Skip to content

Commit 37d5fdc

Browse files
committed
docs: add XGBoost distributed training user guide
Add a user guide for distributed XGBoost training on Kubernetes via Kubeflow Trainer at content/en/docs/components/trainer/user-guides/xgboost.md. The guide provides: - An overview of the XGBoost Collective protocol and how Kubeflow Trainer integrates with it (DMLC_* env vars, JobSet, built-in runtime) - Worker count formula for CPU and GPU training - A redirect to the comprehensive XGBoost tutorial at https://xgboost.readthedocs.io/en/latest/tutorials/kubernetes.html Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>
1 parent c6d6891 commit 37d5fdc

File tree

1 file changed

+63
-0
lines changed
  • content/en/docs/components/trainer/user-guides

1 file changed

+63
-0
lines changed
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
+++
2+
title = "XGBoost Guide"
3+
description = "How to run distributed XGBoost on Kubernetes with Kubeflow Trainer"
4+
weight = 20
5+
+++
6+
7+
This guide describes how to use TrainJob to run distributed
8+
[XGBoost](https://xgboost.readthedocs.io/) training on Kubernetes.
9+
10+
---
11+
12+
## Prerequisites
13+
14+
Before exploring this guide, make sure to follow
15+
[the Getting Started guide](/docs/components/trainer/getting-started/)
16+
to understand the basics of Kubeflow Trainer.
17+
18+
---
19+
20+
## XGBoost Distributed Overview
21+
22+
XGBoost supports distributed training through the
23+
[Collective](https://xgboost.readthedocs.io/en/latest/tutorials/kubernetes.html)
24+
communication protocol (historically known as Rabit). In a distributed setting,
25+
multiple worker processes each operate on a shard of the data and synchronize
26+
histogram bin statistics via AllReduce to agree on the best tree splits.
27+
28+
Kubeflow Trainer integrates with XGBoost by:
29+
30+
- Deploying worker pods as a [JobSet](https://github.com/kubernetes-sigs/jobset).
31+
- Automatically injecting the `DMLC_*` environment variables required by XGBoost's
32+
Collective communication layer (`DMLC_TRACKER_URI`, `DMLC_TRACKER_PORT`,
33+
`DMLC_TASK_ID`, `DMLC_NUM_WORKER`).
34+
- Providing the rank-0 pod with the tracker address so user code can start a
35+
`RabitTracker` for worker coordination.
36+
- Supporting both CPU and GPU training workloads.
37+
38+
The built-in runtime is called `xgboost-distributed` and uses the container image
39+
`ghcr.io/kubeflow/trainer/xgboost-runtime:latest`, which includes XGBoost with
40+
CUDA 12 support, NumPy, and scikit-learn.
41+
42+
### Worker Count
43+
44+
The total number of XGBoost workers is calculated as:
45+
46+
```text
47+
DMLC_NUM_WORKER = numNodes × workersPerNode
48+
```
49+
50+
- **CPU training**: 1 worker per node. Each worker uses OpenMP to parallelize
51+
across all available CPU cores.
52+
- **GPU training**: 1 worker per GPU. The GPU count is derived from
53+
`resourcesPerNode` limits in the TrainJob.
54+
55+
---
56+
57+
## Further Information
58+
59+
For comprehensive documentation including complete training examples (Python SDK
60+
and kubectl YAML), best practices (`QuantileDMatrix`, early stopping,
61+
checkpointing, logging), and common issues, see the XGBoost documentation:
62+
63+
**[Distributed XGBoost on Kubernetes — XGBoost Tutorial](https://xgboost.readthedocs.io/en/latest/tutorials/kubernetes.html)**

0 commit comments

Comments
 (0)