Skip to content

bug: repair_run triggers full job run instead of repairing failed tasks #392

@sgarla

Description

@sgarla

Summary

When a user asks Claude to perform a task repair (re-run only the failed tasks of a previous job run), the skill triggers a full run_now instead of a repair_run. This is because repair_run is not implemented in the MCP tools.

Steps to Reproduce

  1. Run a Databricks job where one or more tasks fail
  2. Ask Claude Code (with ai-dev-kit) to repair the failed run
  3. Observe that a new full job run is triggered instead of repairing the failed tasks

Expected Behavior

Claude should call jobs.repair_run(run_id=<failed_run_id>, ...) to re-run only the failed/skipped tasks from the original run, preserving the successful task outputs.

Actual Behavior

Claude falls back to manage_job_runs(action='run_now', job_id=...), which starts a brand new full run from scratch.

Root Cause

The manage_job_runs MCP tool only exposes these actions: run_now, get, get_output, cancel, list, wait.

A repair action is missing. The Databricks SDK supports w.jobs.repair_run() but it has not been implemented in:

  • databricks-mcp-server/databricks_mcp_server/tools/jobs.py
  • databricks-tools-core/databricks_tools_core/jobs/runs.py

Proposed Fix

  1. Add a repair_run() function in databricks_tools_core/jobs/runs.py using w.jobs.repair_run()
  2. Add a repair action to manage_job_runs in databricks_mcp_server/tools/jobs.py
  3. Update SKILL.md to document the repair workflow

Impact

Customers using ai-dev-kit for job orchestration and failure recovery are inadvertently re-running entire jobs, wasting compute and time.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions