|
| 1 | ++++ |
| 2 | +title = "XGBoost Guide" |
| 3 | +description = "How to run distributed XGBoost on Kubernetes with Kubeflow Trainer" |
| 4 | +weight = 20 |
| 5 | ++++ |
| 6 | + |
| 7 | +This guide describes how to use TrainJob to run distributed |
| 8 | +[XGBoost](https://xgboost.readthedocs.io/) training on Kubernetes. |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +## Prerequisites |
| 13 | + |
| 14 | +Before exploring this guide, make sure to follow |
| 15 | +[the Getting Started guide](/docs/components/trainer/getting-started/) |
| 16 | +to understand the basics of Kubeflow Trainer. |
| 17 | + |
| 18 | +--- |
| 19 | + |
| 20 | +## XGBoost Distributed Overview |
| 21 | + |
| 22 | +XGBoost supports distributed training through the |
| 23 | +[Collective](https://xgboost.readthedocs.io/en/latest/tutorials/kubernetes.html) |
| 24 | +communication protocol (historically known as Rabit). In a distributed setting, |
| 25 | +multiple worker processes each operate on a shard of the data and synchronize |
| 26 | +histogram bin statistics via AllReduce to agree on the best tree splits. |
| 27 | + |
| 28 | +Kubeflow Trainer integrates with XGBoost by: |
| 29 | + |
| 30 | +- Deploying worker pods as a [JobSet](https://github.com/kubernetes-sigs/jobset). |
| 31 | +- Automatically injecting the `DMLC_*` environment variables required by XGBoost's |
| 32 | + Collective communication layer (`DMLC_TRACKER_URI`, `DMLC_TRACKER_PORT`, |
| 33 | + `DMLC_TASK_ID`, `DMLC_NUM_WORKER`). |
| 34 | +- Providing the rank-0 pod with the tracker address so user code can start a |
| 35 | + `RabitTracker` for worker coordination. |
| 36 | +- Supporting both CPU and GPU training workloads. |
| 37 | + |
| 38 | +The built-in runtime is called `xgboost-distributed` and uses the container image |
| 39 | +`ghcr.io/kubeflow/trainer/xgboost-runtime:latest`, which includes XGBoost with |
| 40 | +CUDA 12 support, NumPy, and scikit-learn. |
| 41 | + |
| 42 | +### Worker Count |
| 43 | + |
| 44 | +The total number of XGBoost workers is calculated as: |
| 45 | + |
| 46 | +```text |
| 47 | +DMLC_NUM_WORKER = numNodes × workersPerNode |
| 48 | +``` |
| 49 | + |
| 50 | +- **CPU training**: 1 worker per node. Each worker uses OpenMP to parallelize |
| 51 | + across all available CPU cores. |
| 52 | +- **GPU training**: 1 worker per GPU. The GPU count is derived from |
| 53 | + `resourcesPerNode` limits in the TrainJob. |
| 54 | + |
| 55 | +--- |
| 56 | + |
| 57 | +## Further Information |
| 58 | + |
| 59 | +For comprehensive documentation including complete training examples (Python SDK |
| 60 | +and kubectl YAML), best practices (`QuantileDMatrix`, early stopping, |
| 61 | +checkpointing, logging), and common issues, see the XGBoost documentation: |
| 62 | + |
| 63 | +**[Distributed XGBoost on Kubernetes — XGBoost Tutorial](https://xgboost.readthedocs.io/en/latest/tutorials/kubernetes.html)** |
0 commit comments