Skip to content

Kacper0199/Benchmark-Triton-vLLM

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vLLM and Triton Inference Server Benchmark on Nvidia GH200

This repository contains scripts to benchmark and compare the performance of vLLM and Nvidia Triton Inference Server (with TensorRT-LLM backend). The setup is serving model on PLGrid Nvidia GH200 nodes.

1. Prerequisites

Ensure you have access to the PLGrid infrastructure with GH200 partitions. You need valid API keys for the following services:

  1. Hugging Face Token: Required to download the model weights. Get it from your Hugging Face Settings.
  2. Nvidia NGC API Key: Required to pull the Triton Docker image. Get it from the Nvidia NGC Setup.

2. Configuration

Create a file named .env in the main project directory. Copy the content below and fill in your specific values.

# API Keys
HF_TOKEN="hf_..."
NGC_API_KEY="nvapi-..."

# Project's main directory
PROJECT_DIR=/absolute/path/to/your/repository

# Triton Inference Server + TensorRT-LLM specific configurations
IMAGE=docker://nvcr.io/nvidia/tritonserver:25.11-trtllm-python-py3
SHOULD_CLONE_TRTLLM=false
TRTLLM_REPO_URL=https://github.com/NVIDIA/TensorRT-LLM
TRTLLM_REPO_TAG=v1.0.0
TRITON_START_SCRIPT=triton_start.sh

3. Building the Engine

You must build the TensorRT-LLM engine and prepare the Triton model repository before running any benchmarks. This step needs to be performed only once.

Submit the build job:

sbatch build.sbatch

This process performs the following actions:

  1. Downloads the Qwen 2.5 32B Instruct model from Hugging Face.*
  2. Downloads the TensorRT-LLM source code.*
  3. Converts model checkpoints to TensorRT format.*
  4. Builds the optimized engine in engines/qwen_ckpt.
  5. Generates the Triton configuration in triton_model_repo.

Check build_output.out and build_error.err to confirm the process completed successfully.

*Skipped of already had been done before.

4. Benchmarking vLLM

This step runs the vLLM server and executes the client benchmark. The script automatically handles the Python virtual environment creation and dependency installation.

Submit the vLLM benchmark job:

sbatch benchmark_vllm.sbatch

The script will:

  1. Setup a Python venv in .venv if missing.
  2. Start the vLLM OpenAI-compatible API server on port 8000.
  3. Wait for the server to be ready.
  4. Run the client sweep.py script.

5. Benchmarking Triton Inference Server

This step runs the Triton Inference Server inside an Apptainer container and executes the client benchmark on the host. It isolates the container environment to prevent library conflicts.

Submit the Triton benchmark job:

sbatch benchmark_triton_tensorrtllm.sbatch

The script will:

  1. Clean host environment variables.
  2. Start the Triton container using the triton_start.sh wrapper.
  3. Wait for the server to be ready on port 8001.
  4. Load necessary modules and activate the client venv.
  5. Run the client sweep.py script.

6. Benchmark Parameters

You can customize the benchmark setup by editing the benchmark_code/combos.yaml file.

The benchmark engine executes a test run for every combination of the defined input_output pairs and concurrency_request-rate_prompts tuples. For example, if you define 2 input/output pairs and 3 concurrency configurations, the system will perform 6 separate benchmark runs.

Configuration Parameters

  • input_output: A list of pairs defining the token counts: [input_tokens, output_tokens].
  • concurrency_request-rate_prompts: A list of tuples controlling the load:
    • Max Concurrency: The maximum number of concurrent requests allowed on the server.
    • Request Rate: The number of requests sent to the server per second (following a Poisson distribution).
    • Number of Prompts: The total number of requests to send in this run.
  • goodput: Performance targets (SLO) for the benchmark:
    • ttft: Time To First Token target (in ms).
    • tpot: Time Per Output Token target (in ms).
  • model_params: The number of model parameters (in billions). Ensure this matches your deployed model.
  • precision_bytes: Bytes per parameter (typically 2 for FP16/BF16).
  • gpu_specs: Hardware specifications used for calculations. Note: Verify these values against your specific hardware (e.g., H100, GH200).
    • flops: Theoretical FLOPS performance (e.g., 9.9e14 for ~990 TFLOPS).
    • bandwidth: Memory bandwidth in GB/s.
  • random_range_ratio: Defines the variability of input/output lengths. Must be in the range [0, 1).
    • Example: 0.02 means lengths will vary by ±2%.
    • Formula: [length * (1 - range_ratio), length * (1 + range_ratio)].

Example Configuration

input_output:
  - [128, 1024]
  - [1024, 128]

concurrency_request-rate_prompts:
  # Format: [max_concurrency, request_rate, number_of_prompts]
  - [16, 200, 200]
  - [64, 200, 200]
  - [128, 20, 400]

goodput:
  ttft: 1000    # Target: 1000 ms (1s)
  tpot: 50      # Target: 50 ms

# Model and GPU configuration
model_params: 32      # 32 Billions
precision_bytes: 2    # FP16/BF16

# GPU Specs (Example values for 1x GH200 - adjust as needed)
gpu_specs:
  flops: 9.9e14       # Adjust based on your GPU count and type
  bandwidth: 3746.51

# Dataset config
random_range_ratio: 0.02

7. Scaling to Multiple GPUs (Tensor Parallelism)

The default configuration is created for a single Nvidia GH200 GPU (TP=1), which is sufficient for models up to 32B parameters (in FP16) due to the large 96GB VRAM. However, to maximize throughput (large batch sizes) or support larger models, you can enable Tensor Parallelism (TP=4) to distribute the model across 4 GPUs.

To switch from 1 GPU to 2 GPUs, you must update the following files:

1. Update Resource Allocation (*.sbatch files) Request 2 GPUs in all SLURM batch scripts:

  • build.sbatch
  • benchmark_vllm.sbatch
  • benchmark_triton_tensorrtllm.sbatch

Change directives:

#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=144
#SBATCH --mem=240G

Change num_gpus: 2 in configs/common.yaml.

Performance Note: On a single GH200 node, using TP=1 typically offers lower latency for small batch sizes because it avoids inter-GPU communication overhead. TP=2 is recommended when targeting maximum throughput with very large batch sizes (e.g., 512+) or long context windows, leveraging the combined VRAM of 2 GPUs (~180GB).

8. System Configuration

The benchmarking system relies on a hierarchical structure of configuration files located in the configs/ folder. This allows separating general settings (hardware/global) from model-specific parameters.

8.1. Main File (configs/common.yaml)

This is the starting point for every experiment. It defines:

  • Model: Model name and HF repository path (model_name, hf_repo).
  • Resources: Number of GPUs (num_gpus).
  • Global limits: Max batch size and sequence lengths (target_max_batch_size, target_max_seq_len), which are propagated to other configurations during environment preparation.

8.2. Model-Specific Configurations

Each model has its dedicated folder in configs/models/<Model-Name>/, containing three key files:

  1. build_config.yaml (Triton/TensorRT-LLM Only):
    • Defines engine compilation parameters.
    • Here you set: parallelism (tp_size, pp_size), quantization (e.g., use_weight_only, weight_only_precision), and TRT plugins.
  2. triton_config.yaml (Triton Only):
    • Defines Triton server runtime parameters.
    • Here you set: batching strategies (inflight_fused_batching), KV Cache management (kv_cache_free_gpu_mem_fraction), and scheduling (guaranteed_no_evict vs max_utilization).
  3. vllm_config.yaml (vLLM Only):
    • Defines startup flags for the vLLM server.
    • Here you set: memory optimizations, Prefix Caching, Chunked Prefill, and LoRA parameters.

8.3. Templates and Documentation (configs/models/examples/)

The examples/ folder (e.g., Llama/, Qwen/ subdirectories) contains template files.

  • These contain full descriptions of all available parameters in the form of comments.
  • They serve as documentation – if you don't know what a specific flag does, check it in the file within examples.
  • They contain default values that are safe for most configurations.

8.4. How to Use?

  1. Select Model: In common.yaml, set model_name to the folder name in configs/models/ (e.g., Qwen2.5-32B-Instruct).
  2. Tuning: Go to the selected model's folder and edit the appropriate .yaml files to change parameters (e.g., increase gpu_memory_utilization).
  3. Applying Changes:

Important

Rebuild Required for Triton For TensorRT-LLM (Triton), changing most parameters in build_config.yaml (and some in triton_config.yaml that affect memory structure) requires rebuilding the engine.

After changing the configuration, run: sbatch build.sbatch

  • For vLLM: Changes in vllm_config.yaml are applied with every new benchmark run (benchmark_vllm.sbatch) – a rebuild is not required.

9. Results

Benchmark results are saved in the benchmark_code directory.

  • vLLM Results: Located in benchmark_code/results_vllm.
  • Triton Results: Located in benchmark_code/results_triton.

Each folder contains individual JSON files for every test case and a summary aggregate_results.csv file.

10. NSYS profile

You can generate Nsight report form benchmark. Change nsys_avail: true in configs/common.yaml. Report will be saved in the same directory as results (see in section above). Output is in format: vllm_profile_<jobID>.

11. Cleaning Up

If you need to restart the build process from scratch or clear disk space, remove the generated directories using the following command:

rm -rf TensorRT-LLM tensorrtllm_backend triton_model_repo engines pip_packages model_weights

Notes and Conclusions

  • Interface Comparison: Comparisons between HTTP and gRPC interfaces showed no significant differences in performance.
  • IMPORTANT!!!: It is mandatory to set the #SBATCH --cpus-per-task=72 flag. Otherwise, the scheduler allocates only a single core instead of the full 72-core CPU.
  • Node Exclusivity: Enabling the exclusive flag yields no major changes. The test cases indicate that other tasks running on the node do not have a significant impact on benchmark processing.

  • Cache Configuration: It is necessary to disable prefix caching (vLLM) or KV cache reuse (Triton). Otherwise, when processing multiple scenarios (e.g., varying concurrency) within a single run, subsequent benchmarks would utilize the same set of input prompts. Prefix caching would leverage the stored KV cache and skip the prefill phase. While beneficial in a production environment, this distorts benchmark results (subsequent runs become artificially faster due to caching).

  • Chunk Settings: Default values were retained for chunk_prefill (vLLM) and chunked_context (Triton) - chunk size 8192.

  • Input Randomization: A slight randomness (2%) was introduced to the input and output prompt lengths to better simulate real-world system behavior. With identical prompt lengths, the decode phase duration is uniform for all prompts, causing subsequent requests in the next batch to be processed at the exact same moment. This results in an accumulation of the prefill phase at a single point in time—an edge case rather than standard operation. Introducing randomness can significantly reduce TTFT, although it slightly degrades overall latency and throughput.

About

Performance Benchmark for NVIDIA Triton Inference Server (with TensorRT-LLM backend) on PLGrid Helios supercomputer with NVIDIA GH200 nodes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 81.2%
  • Shell 17.8%
  • Jupyter Notebook 1.0%