Summary
When a user asks Claude to perform a task repair (re-run only the failed tasks of a previous job run), the skill triggers a full run_now instead of a repair_run. This is because repair_run is not implemented in the MCP tools.
Steps to Reproduce
- Run a Databricks job where one or more tasks fail
- Ask Claude Code (with ai-dev-kit) to repair the failed run
- Observe that a new full job run is triggered instead of repairing the failed tasks
Expected Behavior
Claude should call jobs.repair_run(run_id=<failed_run_id>, ...) to re-run only the failed/skipped tasks from the original run, preserving the successful task outputs.
Actual Behavior
Claude falls back to manage_job_runs(action='run_now', job_id=...), which starts a brand new full run from scratch.
Root Cause
The manage_job_runs MCP tool only exposes these actions: run_now, get, get_output, cancel, list, wait.
A repair action is missing. The Databricks SDK supports w.jobs.repair_run() but it has not been implemented in:
databricks-mcp-server/databricks_mcp_server/tools/jobs.py
databricks-tools-core/databricks_tools_core/jobs/runs.py
Proposed Fix
- Add a
repair_run() function in databricks_tools_core/jobs/runs.py using w.jobs.repair_run()
- Add a
repair action to manage_job_runs in databricks_mcp_server/tools/jobs.py
- Update
SKILL.md to document the repair workflow
Impact
Customers using ai-dev-kit for job orchestration and failure recovery are inadvertently re-running entire jobs, wasting compute and time.
Summary
When a user asks Claude to perform a task repair (re-run only the failed tasks of a previous job run), the skill triggers a full
run_nowinstead of arepair_run. This is becauserepair_runis not implemented in the MCP tools.Steps to Reproduce
Expected Behavior
Claude should call
jobs.repair_run(run_id=<failed_run_id>, ...)to re-run only the failed/skipped tasks from the original run, preserving the successful task outputs.Actual Behavior
Claude falls back to
manage_job_runs(action='run_now', job_id=...), which starts a brand new full run from scratch.Root Cause
The
manage_job_runsMCP tool only exposes these actions:run_now,get,get_output,cancel,list,wait.A
repairaction is missing. The Databricks SDK supportsw.jobs.repair_run()but it has not been implemented in:databricks-mcp-server/databricks_mcp_server/tools/jobs.pydatabricks-tools-core/databricks_tools_core/jobs/runs.pyProposed Fix
repair_run()function indatabricks_tools_core/jobs/runs.pyusingw.jobs.repair_run()repairaction tomanage_job_runsindatabricks_mcp_server/tools/jobs.pySKILL.mdto document the repair workflowImpact
Customers using ai-dev-kit for job orchestration and failure recovery are inadvertently re-running entire jobs, wasting compute and time.