Skip to content

Commit e78d65b

Browse files
trainer: add Configure TrainJob Lifecycle user guide (#4348)
* trainer: add Configure TrainJob Lifecycle user guide Adds documentation for activeDeadlineSeconds and suspend/resume lifecycle features introduced in kubeflow/trainer#3258. Signed-off-by: XploY04 <2004agarwalyash@gmail.com> * Apply suggestion from @andreyvelich Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Yash Agarwal <2004agarwalyash@gmail.com> --------- Signed-off-by: XploY04 <2004agarwalyash@gmail.com> Signed-off-by: Yash Agarwal <2004agarwalyash@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
1 parent 564f514 commit e78d65b

File tree

1 file changed

+113
-0
lines changed

1 file changed

+113
-0
lines changed
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
+++
2+
title = "Configure TrainJob Lifecycle"
3+
description = "How to configure active deadlines and suspend/resume for TrainJobs"
4+
weight = 80
5+
+++
6+
7+
This guide describes how to configure lifecycle policies for TrainJobs, including active deadlines
8+
to automatically terminate long-running jobs and suspend/resume to pause and restart training.
9+
10+
## Prerequisites
11+
12+
Before exploring this guide, make sure to follow [the Getting Started guide](/docs/components/trainer/getting-started/)
13+
to understand the basics of Kubeflow Trainer.
14+
15+
## Active Deadline Overview
16+
17+
The `activeDeadlineSeconds` field in the TrainJob spec specifies the maximum duration (in seconds)
18+
that a TrainJob is allowed to run before the system automatically terminates it. This behavior
19+
matches the
20+
[Kubernetes Job `activeDeadlineSeconds`](https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup)
21+
semantics.
22+
23+
When the deadline is reached, all running Pods are terminated, the underlying JobSet is deleted,
24+
and the TrainJob status is set to `Failed` with reason `DeadlineExceeded`.
25+
26+
The deadline timer starts from the TrainJob creation time. If the TrainJob is suspended and then
27+
resumed, the timer resets from the resume time. The field is immutable after creation and the
28+
minimum allowed value is `1`.
29+
30+
## Create TrainJob with Active Deadline
31+
32+
You can set `activeDeadlineSeconds` on a TrainJob to enforce a time limit. The following
33+
example creates a TrainJob that is terminated if it runs longer than 1 hour:
34+
35+
```yaml
36+
apiVersion: trainer.kubeflow.org/v1alpha1
37+
kind: TrainJob
38+
metadata:
39+
name: my-trainjob
40+
namespace: my-namespace
41+
spec:
42+
activeDeadlineSeconds: 3600 # terminate the TrainJob after 1 hour
43+
runtimeRef:
44+
name: torch-distributed
45+
kind: ClusterTrainingRuntime
46+
apiGroup: trainer.kubeflow.org
47+
trainer:
48+
image: docker.io/my-training-image:latest
49+
command:
50+
- torchrun
51+
- train.py
52+
```
53+
54+
### Verify the TrainJob Deadline Status
55+
56+
After the deadline is exceeded, the TrainJob transitions to a `Failed` state. Run the following
57+
command to check the TrainJob conditions:
58+
59+
```sh
60+
kubectl get trainjob my-trainjob -o jsonpath='{.status.conditions[?(@.status=="True")]}'
61+
```
62+
63+
You should see a condition as follows:
64+
65+
```json
66+
{
67+
"type": "Failed",
68+
"status": "True",
69+
"reason": "DeadlineExceeded",
70+
"message": "TrainJob exceeded its active deadline"
71+
}
72+
```
73+
74+
## Suspend and Resume TrainJob
75+
76+
The `suspend` field allows you to pause a running TrainJob without deleting it. When a TrainJob
77+
is suspended, its Pods are terminated but the TrainJob resource and its configuration are preserved.
78+
This is useful for temporarily freeing cluster resources or debugging training issues.
79+
80+
The following example creates a TrainJob in suspended state:
81+
82+
```yaml
83+
apiVersion: trainer.kubeflow.org/v1alpha1
84+
kind: TrainJob
85+
metadata:
86+
name: my-trainjob
87+
namespace: my-namespace
88+
spec:
89+
suspend: true
90+
runtimeRef:
91+
name: torch-distributed
92+
kind: ClusterTrainingRuntime
93+
apiGroup: trainer.kubeflow.org
94+
```
95+
96+
To resume the TrainJob, update the `suspend` field to `false`:
97+
98+
```sh
99+
kubectl patch trainjob my-trainjob --type=merge -p '{"spec":{"suspend":false}}'
100+
```
101+
102+
{{% alert title="Note" color="info" %}}
103+
When a TrainJob with ActiveDeadlineSeconds is resumed from suspension, the deadline timer
104+
resets from the resume time and not the original creation time. This means the TrainJob gets
105+
the full deadline duration after each resume.
106+
{{% /alert %}}
107+
108+
## Next Steps
109+
110+
- Learn more about [runtime patches](/docs/components/trainer/operator-guides/runtime-patches/)
111+
for customizing TrainJob behavior.
112+
- Check out [the TrainJob API reference](https://github.com/kubeflow/trainer/blob/master/pkg/apis/trainer/v1alpha1/trainjob_types.go)
113+
for the full list of `TrainJobSpec` fields.

0 commit comments

Comments
 (0)