Skip to content

fix(rayjob): use grace period env var for submitter finished timeout#8

Open
jld-adriano wants to merge 1 commit intomasterfrom
devin/1774985581-fix-submitter-grace-period
Open

fix(rayjob): use grace period env var for submitter finished timeout#8
jld-adriano wants to merge 1 commit intomasterfrom
devin/1774985581-fix-submitter-grace-period

Conversation

@jld-adriano
Copy link
Copy Markdown

@jld-adriano jld-adriano commented Mar 31, 2026

Why are these changes needed?

checkSubmitterFinishedTimeoutAndUpdateStatusIfNeeded uses a hardcoded 30-second timeout (DefaultSubmitterFinishedTimeout) to decide how long to wait after the submitter pod finishes before failing the RayJob if it hasn't reached a terminal state. On non-AWS nodes (e.g. Mithril), model downloads from S3 cause Ray head startup to take longer than 30s, so the job is prematurely marked as failed with JobDeploymentStatusTransitionGracePeriodExceeded.

There is already a companion function, checkTransitionGracePeriodAndUpdateStatusIfNeeded, that reads the RAYJOB_DEPLOYMENT_STATUS_TRANSITION_GRACE_PERIOD_SECONDS env var for its own grace period. This PR makes the submitter-finished timeout also read that same env var, falling back to the 30s default when it's unset.

Key review points:

  • Both grace period checks now share a single env var. This is intentional (one knob to configure), but reviewers should confirm there's no scenario where different values are needed for the two timeouts.
  • os.Getenv is called on each reconciliation rather than cached at startup — this matches the existing pattern in checkTransitionGracePeriodAndUpdateStatusIfNeeded.

Related issue number

N/A — discovered during Flyte integration testing on Mithril GPU nodes where the 30s hardcoded timeout was too short.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Manually validated by running a Flyte integration test that exercises KubeRay RayJob submission on a Mithril GPU node where head pod startup exceeds 30s.

Link to Devin session: https://app.devin.ai/sessions/e25862cfa13341aea414ccbec3fb65af
Requested by: @jld-adriano


Open with Devin

…ONDS env var for submitter finished timeout

The checkSubmitterFinishedTimeoutAndUpdateStatusIfNeeded function used a hardcoded
30s timeout (DefaultSubmitterFinishedTimeout) for the grace period between when the
submitter pod finishes and when the Ray job reaches terminal state. This was too
short for workloads on non-AWS nodes (e.g. Mithril) where model downloads from S3
take longer.

This change makes the function read the existing
RAYJOB_DEPLOYMENT_STATUS_TRANSITION_GRACE_PERIOD_SECONDS env var (same one used by
checkTransitionGracePeriodAndUpdateStatusIfNeeded), falling back to the 30s default
if not set. This allows operators to configure a single knob for both grace periods.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@devin-ai-integration
Copy link
Copy Markdown

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant