Skip to content

Commit 3c3ee2f

Browse files
QuentinAmbardQuentin Ambardclaudecalreynolds
authored
Improve code execution and synthetic data generation skills (#403)
* Enhance AI/BI dashboard skill with comprehensive widget specs Added missing documentation from production dashboard generation: 1-widget-specifications.md: - Combo charts (bar + line on same widget) with version 1 - Counter number formatting (currency, percent, plain number) - Widget name max length (60 characters) - Color scale restrictions (no scheme/colorRamp/mappings) - Quantitative color encoding for gradient effects - Bar chart group vs stacked decision criteria with examples 2-filters.md: - Date range picker complete example - Multi-dataset filter binding (one query per dataset) - Global filter performance note (auto WHERE clause) SKILL.md: - ORDER BY guidance for time series and rankings 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add TOP-N + Other bucketing guidance for high-cardinality dimensions When a dimension has too many values (50+ stores, products, etc.), charts become unreadable. Added guidance to: - Check cardinality via get_table_details before charting - Use TOP-N + "Other" SQL pattern to bucket low-value items - Aggregate to higher abstraction level as alternative - Use table widgets for high-cardinality data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Remove async deploy_dashboard function for consistency The codebase doesn't use async anywhere else, so remove the unused async version of deploy_dashboard and keep only the synchronous one. - Remove asyncio import - Remove async deploy_dashboard function (was using asyncio.to_thread) - Rename deploy_dashboard_sync to deploy_dashboard - Update exports in __init__.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add genie_space_id parameter to dashboard creation Allow linking a Genie space to a dashboard by passing genie_space_id. This enables the "Ask Genie" button on the dashboard UI. The Genie space config is injected into the serialized_dashboard JSON under uiSettings.genieSpace with isEnabled=true and enablementMode=ENABLED. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add catalog and schema parameters to dashboard creation Allow setting default catalog and schema for dashboard datasets via the dataset_catalog and dataset_schema API parameters. These defaults apply to unqualified table names in SQL queries. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add comprehensive date range filtering documentation - Document field-based filtering (automatic IN_RANGE on date fields) - Document parameter-based filtering (:date_range.min/max in SQL) - Show how to combine both approaches in one filter - Add guidance on when NOT to apply date filtering (MRR, all-time totals) - Update SKILL.md tools table with new genie_space_id, catalog, schema params 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Restructure AI/BI dashboard skill with improved organization - Split widget specs into basic (1-widget-specifications.md) and advanced (2-advanced-widget-specifications.md) files - Add area chart, scatter plot, combo chart, and choropleth map documentation - Rename files for consistent numbering (3-filters, 4-examples, 5-troubleshooting) - Remove duplicate information across files (versions, naming rules, etc.) - Add widget display formatting guidance (currency, percentage, displayName) - Simplify SKILL.md quality checklist with link to version table - Shorten verbose examples while preserving all critical information - Clarify query naming convention for charts vs filters 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add back critical behavioral instructions for text widgets and filters * Restore important behavioral instructions removed during restructure - Counter: full Pattern 2 example with CRITICAL field name matching note - Table: disaggregated:true guidance and bold emphasis - Line/Bar: x,y,color encodings and disaggregated guidance - Pie: 3-8 category limit for readability * Restore detailed guidance that was removed during restructure - 5-troubleshooting.md: Restore full troubleshooting content with version guidance, filter debugging, and detailed error explanations - SKILL.md: Restore full 10-item quality checklist - SKILL.md: Restore standard dashboard structure example - SKILL.md: Restore cardinality guidance table (with softer 'suggested' language) * Optimize MCP tool docstrings for token efficiency - Reduce docstring verbosity across all 18 tool files (~89% reduction) - Keep all functional information while being concise - Add skill references to complex tools (dashboards, vector search, genie, jobs, pipelines, lakebase, unity catalog, serving, apps, agent bricks) - Maintain human readability with bullet points and structure - Preserve critical warnings (ASK USER FIRST, CONFIRM WITH USER) - Keep return format hints for AI parsing Net reduction: 1,843 lines across 18 files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add parameter context for ambiguous docstring params - agent_bricks.py: Add context for description, instructions, volume_path, examples - genie.py: Add context for table_identifiers, description, sample_questions, serialized_space - jobs.py: Add context for tasks, job_clusters, environments, schedule, git_source - lakebase.py: Add context for source_branch, ttl_seconds, is_protected, autoscaling params, and sync source/target table names - pipelines.py: Add context for root_path, workspace_file_paths, extra_settings, full_refresh 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Consolidate MCP tools from 77 to 44 (43% reduction) Tool consolidations: - pipelines.py: 10→2 (manage_pipeline, manage_pipeline_run) - volume_files.py: 6→1 (manage_volume_files) - aibi_dashboards.py: 4→1 (manage_dashboard) - vector_search.py: 8→4 (manage_vs_endpoint, manage_vs_index, query_vs_index, manage_vs_data) - genie.py: 5→2 (manage_genie, ask_genie) - serving.py: 3→1 (manage_serving_endpoint) - apps.py: 3→1 (manage_app) - file.py: 2→1 (manage_workspace_files) - sql.py: 6→5 (manage_warehouse replaces list/get_best) - lakebase.py: 8→4 (manage_lakebase_database, manage_lakebase_branch, manage_lakebase_sync, generate_lakebase_credential) Key patterns: - All consolidated tools use an action parameter - Each action has required params documented in docstring - Error messages specify which params are required - Hot paths (query_vs_index, ask_genie) kept separate for clarity - All skills updated with action tables and examples 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add integration test infrastructure and fix tool bugs Test infrastructure: - Add comprehensive integration tests for all MCP tools - Add test runner script with parallel execution support - Add fixtures for workspace, catalog, and resource cleanup - Add test resources (PDFs, SQL files, app configs) Bug fixes in databricks-tools-core: - Fix workspace file upload for directories - Fix job notebook path handling - Fix vector search index operations - Fix apps API responses - Fix dashboard widget handling - Fix agent bricks manager listing Bug fixes in MCP server tools: - Add quota skip handling for apps test - Fix genie space operations - Fix lakebase database operations - Fix compute cluster lifecycle handling - Fix dashboard operations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Change manage_dashboard to use file path instead of inline JSON - Replace serialized_dashboard param with dashboard_file_path - Tool reads JSON from local file for easier iterative development - Update SKILL.md with new workflow documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Update dashboard tests for file-based approach - Change simple_dashboard_json fixture to simple_dashboard_file - Update all manage_dashboard calls to use dashboard_file_path - Add tempfile imports and tmp_path usage for update test 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix deploy_app to correctly handle SDK Wait[AppDeployment] return type The Databricks SDK's w.apps.deploy() returns a Wait[AppDeployment] object, not an AppDeployment directly. The previous code passed the Wait object to _deployment_to_dict(), which caused getattr() to return None for all attributes since the Wait object doesn't have them. This fix uses wait_obj.response to get the actual AppDeployment object before converting it to a dictionary. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix deploy_app to correctly handle SDK Wait[AppDeployment] return type The Databricks SDK's w.apps.deploy() returns a Wait[AppDeployment] object, not an AppDeployment directly. The previous code passed the Wait object to _deployment_to_dict(), which caused getattr() to return None for all attributes since the Wait object doesn't have them. This fix uses wait_obj.response to get the actual AppDeployment object before converting it to a dictionary. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Clarify MCP tool usage in Genie skill documentation - Add tools summary table at top of MCP Tools section - Change code blocks from python syntax to plain text - Add "# MCP Tool: <name>" comments to clarify these are tool calls, not Python code - Move Supporting Tools table to main tools table 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix typo in aibi_dashboards.py docstring Remove garbage characters from widget documentation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix genie tools to use SDK methods instead of manager - Use w.genie.trash_space() in _delete_genie_resource - Add _find_space_by_name() using SDK's list_spaces with pagination - Use w.genie.update_space() and w.genie.create_space() for space management - Use w.genie.get_space() with include_serialized_space in _get_genie_space - Fix validation to allow space_id for updates without display_name 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Improve integration test reliability and timeout handling - Add per-suite timeout in run_tests.py (10 min default, configurable) - Improve apps test with better cleanup and assertions - Add skip logic for quota-exceeded scenarios 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Improve Unity Catalog tool docstrings with comprehensive parameter documentation Add detailed parameter documentation to all 9 Unity Catalog MCP tools: - manage_uc_objects: Document parameters by object_type (catalog/schema/volume/function) - manage_uc_grants: Add privilege lists per securable type - manage_uc_storage: Detail credential and external_location parameters - manage_uc_connections: Document connection_type options and create_foreign_catalog - manage_uc_tags: Detail set_tags/unset_tags/query parameters - manage_uc_security_policies: Document row filter and column mask parameters - manage_uc_monitors: Detail monitor creation and refresh parameters - manage_uc_sharing: Document share/recipient/provider resource types - manage_metric_views: Detail dimension/measure format and query parameters 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add CRITICAL validation steps to dashboard tool docstring Add clear instructions requiring users to: 0. Review the databricks-aibi-dashboards skill for widget JSON structure 1. Call get_table_stats_and_schema() for table schemas 2. Call execute_sql() to test EVERY query before use This prevents widgets from showing errors due to untested queries. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add design best practices section and use relative file paths - Add Design Best Practices section for default dashboard behaviors - Change /tmp paths to ./ for less opinionated examples - Update parent_path example to use {user_email} placeholder 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Improve code execution and synthetic data gen skills Execution compute skill: - Split into 3 clear execution modes: Databricks Connect (default), Serverless Job, Interactive Cluster - Add decision matrix for choosing execution mode - Add job_extra_params for custom dependencies in serverless jobs - Create dedicated reference files for each mode Synthetic data gen skill: - Emphasize business story: problem → impact → analysis → solution - Add guidance to propose compelling stories by default - Consolidate reference files (6 → 2) - Add critical rules for data coherence and Databricks value - Clarify when to read reference files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add performance rules for synthetic data generation - Add Critical Rule #12: No Python loops or .collect() - Add Performance Rules section with anti-pattern table - Emphasize Spark parallelism over driver-side iteration 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Clarify parquet path is folder with table name 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Improve AI/BI dashboard skill documentation with comprehensive examples - Replace basic NYC taxi examples with complete Sales Analytics dashboard - Add critical widget version requirements table to SKILL.md - Add data validation guidance to verify dashboards tell intended story - Document key patterns: page types, KPI formatting, filter binding, layout grid 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add skill reading requirement to dashboard MCP tool docstring Require agent to read 4-examples.md before creating dashboards, and if unfamiliar, read full skill documentation first. Valid JSON is critical. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Emphasize Databricks Connect serverless, add setup requirements - Rule #10: Use Databricks Connect Serverless by default (avoid execute_code) - Setup: Use uv, require Python 3.12 and databricks-connect>=16.4 - Common issues: Add DatabricksEnv ImportError and Python version fixes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix MCP server crash on request cancellation When a client cancels a long-running MCP request, there's a race condition between the cancellation and normal response paths: 1. Client cancels request → RequestResponder.cancel() sends error response and sets _completed = True 2. Middleware catches CancelledError and returns a ToolResult 3. MCP SDK tries to call message.respond(response) 4. Crash: assert not self._completed fails Fix: Re-raise CancelledError instead of returning a result, allowing the MCP SDK's cancellation handler to properly manage the response lifecycle. See: modelcontextprotocol/python-sdk#1153 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add structured_content to error responses for MCP SDK validation When tools have an outputSchema (auto-generated from return type like Dict[str, Any]), MCP SDK requires structured_content in all responses. The middleware was returning ToolResult without structured_content for error cases (timeout, exceptions), causing validation errors: "Output validation error: outputSchema defined but no structured output returned" Fix: Include structured_content with the same error data in all error responses. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Migrate KA operations to Python SDK and fix name lookup issues - Migrate ka_create, ka_get, ka_sync_sources to use Python SDK - Keep ka_update using raw API 2.1 due to SDK FieldMask bug (converts snake_case to camelCase but API expects snake_case) - Fix find_by_name to sanitize names (spaces→underscores) before lookup - Fix ka_create_or_update to lookup by name when no tile_id provided, preventing ALREADY_EXISTS errors on repeated calls - Update MCP tool layer to use new flat response format - Map SDK state values (ACTIVE, CREATING, FAILED) to endpoint_status - Add integration test for updating existing KA via create_or_update 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix knowledge source description requirement and test ordering - Provide default description for knowledge sources when not specified (API requires non-empty knowledge_source.description) - Move KA update test to after endpoint is ONLINE (update requires ACTIVE state) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix structured_content not populated for tools with return type annotations FastMCP auto-generates outputSchema from return type annotations (e.g., -> Dict[str, Any]) but doesn't populate structured_content in ToolResult. MCP SDK validation then fails: "outputSchema defined but no structured output" Fix: Intercept successful results and populate structured_content from JSON text content when missing. Only modifies results when: 1. structured_content is missing 2. There's exactly one TextContent item 3. The text is valid JSON that parses to a dict 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * fix(mcp): apply async wrapper on all platforms to prevent cancellation crashes The asyncio.to_thread() wrapper was only applied on Windows, but it's needed on ALL platforms to enable proper cancellation handling. Without this fix, when a sync tool runs longer than the client timeout: 1. Client sends cancellation 2. Sync tool blocks event loop, can't receive CancelledError 3. Tool eventually returns, but MCP SDK already responded to cancel 4. AssertionError: "Request already responded to" → server crashes This was discovered when uploading 7,375 files triggered a timeout, crashing the MCP server on macOS. Extends the fix from PR #411 which added CancelledError handling in middleware - that fix only works when cancellation can propagate, which requires async execution via to_thread(). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix: don't set structured_content on error responses Setting structured_content causes MCP SDK to validate it against the tool's outputSchema. For error responses, the error dict {"error": True, ...} doesn't match the expected return type (e.g., Union[str, List[Dict]]), causing "Output validation error: 'result' is a required property". Fix: Only set structured_content for successful responses, not errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Improve dashboard skill structure based on error analysis - Add JSON skeleton section to SKILL.md showing required structure - Add Genie note clarifying it's not a widget (use genie_space_id param) - Move Key Patterns to top of 4-examples.md for discoverability - Clarify example is reference only - adapt to user's actual requirements - Add structural errors table to 5-troubleshooting.md Root cause fixes: - queryLines must be array, not "query": "string" - Widgets must be inline in layout[].widget, not separate array - pageType required on every page 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Quentin Ambard <quentin.ambard@databricks.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: calreynolds <calrey98@gmail.com>
1 parent f81ecd4 commit 3c3ee2f

File tree

14 files changed

+635
-1563
lines changed

14 files changed

+635
-1563
lines changed

databricks-mcp-server/databricks_mcp_server/tools/compute.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ def execute_code(
5858
destroy_context_on_completion: bool = False,
5959
workspace_path: str = None,
6060
run_name: str = None,
61+
job_extra_params: Dict[str, Any] = None,
6162
) -> Dict[str, Any]:
6263
"""Execute code on Databricks via serverless or cluster compute.
6364
@@ -72,6 +73,8 @@ def execute_code(
7273
file_path: Run local file (.py/.scala/.sql/.r), auto-detects language.
7374
workspace_path: Save as notebook in workspace (omit for ephemeral).
7475
.ipynb: Pass raw JSON with serverless, auto-detected.
76+
job_extra_params: Extra job params (serverless only). For dependencies:
77+
{"environments": [{"environment_key": "env", "spec": {"client": "4", "dependencies": ["pandas", "sklearn"]}}]}
7578
7679
Timeouts: serverless=1800s, cluster=120s, file=600s.
7780
Returns: {success, output, error, cluster_id, context_id} or {run_id, run_url}."""
@@ -137,6 +140,7 @@ def execute_code(
137140
run_name=run_name,
138141
cleanup=workspace_path is None,
139142
workspace_path=workspace_path,
143+
job_extra_params=job_extra_params,
140144
)
141145
return result.to_dict()
142146

databricks-skills/databricks-execution-compute/SKILL.md

Lines changed: 41 additions & 153 deletions
Original file line numberDiff line numberDiff line change
@@ -13,182 +13,70 @@ description: >-
1313

1414
# Databricks Execution & Compute
1515

16-
Run code on Databricks and manage compute resources — all through 4 consolidated MCP tools.
16+
Run code on Databricks. Three execution modes—choose based on workload.
1717

18-
## MCP Tools
19-
20-
### execute_code
21-
22-
Single entry point for all code execution on Databricks.
23-
24-
| Parameter | Type | Default | Description |
25-
|-----------|------|---------|-------------|
26-
| `code` | string | None | Code to execute. Required unless `file_path` is set |
27-
| `file_path` | string | None | Local file to execute (.py, .scala, .sql, .r). Language auto-detected |
28-
| `compute_type` | string | `"auto"` | `"serverless"`, `"cluster"`, or `"auto"` |
29-
| `cluster_id` | string | auto-selected | Cluster to run on (cluster compute) |
30-
| `context_id` | string | None | Reuse execution context for state preservation (cluster compute) |
31-
| `language` | string | `"python"` | `"python"`, `"scala"`, `"sql"`, or `"r"` |
32-
| `timeout` | int | varies | Max wait seconds. Defaults: serverless=1800, cluster=120, file=600 |
33-
| `destroy_context_on_completion` | bool | `false` | Destroy context after execution (cluster compute) |
34-
| `workspace_path` | string | None | Persist notebook at this workspace path |
35-
| `run_name` | string | auto-generated | Human-readable run name (serverless only) |
36-
37-
**compute_type resolution ("auto" mode):**
38-
- Has `cluster_id` or `context_id` → cluster
39-
- Language is scala/r → cluster
40-
- Otherwise → serverless
41-
42-
**When to use each compute_type:**
43-
44-
| Scenario | compute_type | Why |
45-
|----------|-------------|-----|
46-
| Run Python, no cluster available | `"serverless"` | No cluster needed; serverless spins up automatically |
47-
| Run local file on a cluster | `"cluster"` + `file_path` | Auto-detects language; supports Python, Scala, SQL, R |
48-
| Interactive iteration (preserve variables) | `"cluster"` | Keep context alive across calls via `context_id` |
49-
| SQL queries that need result rows | Use `execute_sql` tool | Works with SQL warehouses; returns data |
50-
| Batch/ETL Python | `"serverless"` | Dedicated resources, up to 30 min timeout |
51-
52-
### manage_cluster
53-
54-
Create, modify, start, terminate, or delete clusters.
55-
56-
| Parameter | Type | Default | Description |
57-
|-----------|------|---------|-------------|
58-
| `action` | string | *(required)* | `"create"`, `"modify"`, `"start"`, `"terminate"`, or `"delete"` |
59-
| `cluster_id` | string | None | Required for modify, start, terminate, delete |
60-
| `name` | string | None | Required for create, optional for modify |
61-
| `num_workers` | int | `1` | Fixed worker count (ignored if autoscale is set) |
62-
| `spark_version` | string | latest LTS | DBR version key |
63-
| `node_type_id` | string | auto-picked | Worker node type |
64-
| `autotermination_minutes` | int | `120` | Minutes of inactivity before auto-stop |
65-
| `data_security_mode` | string | `"SINGLE_USER"` | Security mode |
66-
| `spark_conf` | string (JSON) | None | Spark config overrides |
67-
| `autoscale_min_workers` | int | None | Min workers for autoscaling |
68-
| `autoscale_max_workers` | int | None | Max workers for autoscaling |
69-
70-
**DESTRUCTIVE:** `"delete"` is permanent and irreversible — always confirm with user.
71-
**COSTLY:** `"start"` consumes cloud resources (3-8 min startup) — always ask user first.
72-
73-
### manage_sql_warehouse
74-
75-
Create, modify, or delete SQL warehouses.
76-
77-
| Parameter | Type | Default | Description |
78-
|-----------|------|---------|-------------|
79-
| `action` | string | *(required)* | `"create"`, `"modify"`, or `"delete"` |
80-
| `warehouse_id` | string | None | Required for modify and delete |
81-
| `name` | string | None | Required for create |
82-
| `size` | string | `"Small"` | T-shirt size (2X-Small through 4X-Large) |
83-
| `min_num_clusters` | int | `1` | Minimum clusters |
84-
| `max_num_clusters` | int | `1` | Maximum clusters for scaling |
85-
| `auto_stop_mins` | int | `120` | Auto-stop after inactivity |
86-
| `warehouse_type` | string | `"PRO"` | PRO or CLASSIC |
87-
| `enable_serverless` | bool | `true` | Enable serverless compute |
88-
89-
**DESTRUCTIVE:** `"delete"` is permanent — always confirm with user.
90-
91-
For listing warehouses, use the `manage_warehouse(action="list")` tool (SQL tools).
92-
93-
### list_compute
94-
95-
List and inspect compute resources.
96-
97-
| Parameter | Type | Default | Description |
98-
|-----------|------|---------|-------------|
99-
| `resource` | string | `"clusters"` | `"clusters"`, `"node_types"`, or `"spark_versions"` |
100-
| `cluster_id` | string | None | Get status for a specific cluster (poll after starting) |
101-
| `auto_select` | bool | `false` | Return the best running cluster (prefers "shared" > "demo") |
102-
103-
## Ephemeral vs Persistent Mode
18+
## Execution Mode Decision Matrix
10419

105-
All execution supports two modes:
20+
| Aspect | [Databricks Connect](references/1-databricks-connect.md)| [Serverless Job](references/2-serverless-job.md) | [Interactive Cluster](references/3-interactive-cluster.md) |
21+
|--------|-------------------|----------------|---------------------|
22+
| **Use for** | Spark code (ETL, data gen) | Heavy processing (ML) | State across tool calls, Scala/R |
23+
| **Startup** | Instant | ~25-50s cold start | ~5min if stopped |
24+
| **State** | Within Python process | None | Via context_id |
25+
| **Languages** | Python (PySpark) | Python, SQL | Python, Scala, SQL, R |
26+
| **Dependencies** | `withDependencies()` | CLI with environments spec | Install on cluster |
10627

107-
**Ephemeral (default):** Code is executed and no artifact is saved. Good for testing, exploration.
28+
### Decision Flow
10829

109-
**Persistent:** Pass `workspace_path` to save as a notebook in Databricks workspace. Good for model training, ETL, project work. Suggest paths like:
110-
`/Workspace/Users/{username}/{project-name}/`
111-
112-
## When No Cluster Is Available
113-
114-
If cluster execution finds no running cluster:
115-
1. The error response includes `startable_clusters` and `suggestions`
116-
2. Ask the user if they want to start a terminated cluster (3-8 min startup)
117-
3. Or suggest `compute_type="serverless"` for Python (no cluster needed)
118-
4. Or suggest `execute_sql` for SQL workloads (uses SQL warehouses)
119-
120-
## Limitations
121-
122-
| Limitation | Applies To | Details |
123-
|-----------|------------|---------|
124-
| Cold start ~25-50s | Serverless | Serverless compute spin-up time |
125-
| No interactive state | Serverless | Each invocation is fresh; no variables persist |
126-
| Python and SQL only | Serverless | No R, Scala, or Java on serverless |
127-
| SQL SELECT not captured | Serverless | Use `execute_sql` for SELECT queries |
128-
| Cluster must be running | Cluster | Use manage_cluster start or switch to serverless |
129-
| print() output unreliable | Serverless | Use `dbutils.notebook.exit()` instead |
130-
131-
## Quick Start Examples
132-
133-
### Run Python on serverless
134-
135-
```python
136-
execute_code(code="dbutils.notebook.exit('hello from serverless')")
13730
```
31+
Spark-based code? → Databricks Connect (fastest)
32+
└─ Python 3.12 missing? → Install it + databricks-connect
33+
└─ Install fails? → Ask user (don't auto-switch modes)
13834
139-
### Run Python on serverless (persistent)
140-
141-
```python
142-
execute_code(
143-
code=training_code,
144-
workspace_path="/Workspace/Users/user@company.com/ml-project/train",
145-
run_name="model-training-v1"
146-
)
35+
Heavy/long-running (ML)? → Serverless Job (independent)
36+
Need state across calls? → Interactive Cluster (list and ask which one to use)
37+
Scala/R? → Interactive Cluster (list and ask which one to use)
14738
```
14839

149-
### Run a local file on a cluster
15040

151-
```python
152-
execute_code(file_path="/local/path/to/etl.py", compute_type="cluster")
153-
```
154-
155-
### Interactive iteration on a cluster
41+
## How to Run Code
15642

157-
```python
158-
# First call — creates context
159-
result = execute_code(code="import pandas as pd\ndf = pd.DataFrame({'a': [1,2,3]})", compute_type="cluster")
160-
# Follow-up — reuses context
161-
execute_code(code="print(df.shape)", context_id=result["context_id"], cluster_id=result["cluster_id"])
162-
```
43+
**Read the reference file for your chosen mode before proceeding.**
16344

164-
### Create a cluster
45+
### Databricks Connect (no MCP tool, run locally) → [reference](references/1-databricks-connect.md)
16546

166-
```python
167-
manage_cluster(action="create", name="my-dev-cluster", num_workers=2)
47+
```bash
48+
python my_spark_script.py
16849
```
16950

170-
### Create an autoscaling cluster
51+
### Serverless Job → [reference](references/2-serverless-job.md)
17152

17253
```python
173-
manage_cluster(action="create", name="scaling-cluster", autoscale_min_workers=1, autoscale_max_workers=8)
54+
execute_code(file_path="/path/to/script.py")
17455
```
17556

176-
### Start a terminated cluster
57+
### Interactive Cluster → [reference](references/3-interactive-cluster.md)
17758

17859
```python
179-
manage_cluster(action="start", cluster_id="1234-567890-abcdef")
180-
# Poll until running
181-
list_compute(resource="clusters", cluster_id="1234-567890-abcdef")
60+
# Check for running clusters first (or use the one instructed)
61+
list_compute(resource="clusters")
62+
# Ask the customer which one to use
63+
64+
# Run code, reuse context_id for follow-up MCP call
65+
result = execute_code(code="...", compute_type="cluster", cluster_id="...")
66+
execute_code(code="...", context_id=result["context_id"], cluster_id=result["cluster_id"])
18267
```
18368

184-
### Create a SQL warehouse
69+
## MCP Tools
18570

186-
```python
187-
manage_sql_warehouse(action="create", name="analytics-wh", size="Medium")
188-
```
71+
| Tool | For | Purpose |
72+
|------|-----|---------|
73+
| `execute_code` | Serverless, Interactive | Run code remotely |
74+
| `list_compute` | Interactive | List clusters, check status, auto-select running cluster |
75+
| `manage_cluster` | Interactive | Create, start, terminate, delete. **COSTLY:** `start` takes 3-8 min—ask user |
76+
| `manage_sql_warehouse` | SQL | Create, modify, delete SQL warehouses |
18977

19078
## Related Skills
19179

192-
- **[databricks-jobs](../databricks-jobs/SKILL.md)**Production job orchestration with scheduling, retries, and multi-task DAGs
193-
- **[databricks-dbsql](../databricks-dbsql/SKILL.md)**SQL warehouse capabilities and AI functions
194-
- **[databricks-python-sdk](../databricks-python-sdk/SKILL.md)**Direct SDK usage for workspace automation
80+
- **[databricks-synthetic-data-gen](../databricks-synthetic-data-gen/SKILL.md)**Data generation using Spark + Faker
81+
- **[databricks-jobs](../databricks-jobs/SKILL.md)**Production job orchestration
82+
- **[databricks-dbsql](../databricks-dbsql/SKILL.md)**SQL warehouse and AI functions
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Databricks Connect (Recommended Default)
2+
3+
**Use when:** Running Spark code locally that executes on Databricks serverless compute. This is the fastest, cleanest approach for data generation, ETL, and any Spark workload.
4+
5+
## Why Databricks Connect First?
6+
7+
- **Instant iteration** — Edit file, re-run immediately
8+
- **Local debugging** — IDE debugger, breakpoints work
9+
- **No cold start** — Session stays warm across executions
10+
- **Clean dependencies**`withDependencies()` installs packages on remote compute
11+
12+
## Requirements
13+
14+
- **Python 3.12** (databricks-connect >= 16.4 requires it)
15+
- **databricks-connect >= 16.4** package
16+
- **~/.databrickscfg** with serverless config
17+
18+
## Setup
19+
20+
**Python 3.12 required.** If not available, install it (uv or other). If install fails, ask user—don't auto-switch modes.
21+
22+
Use default profile, if not setup you can add it `~/.databrickscfg` (never overwrite it without conscent)
23+
```ini
24+
[DEFAULT]
25+
host = https://your-workspace.cloud.databricks.com/
26+
serverless_compute_id = auto
27+
auth_type = databricks-cli
28+
```
29+
30+
## Usage Pattern
31+
32+
```python
33+
from databricks.connect import DatabricksSession, DatabricksEnv
34+
35+
# Declare dependencies installed on serverless compute
36+
# CRITICAL: Include ALL packages used inside UDFs (pandas/numpy are there by default)
37+
env = DatabricksEnv().withDependencies("faker", "holidays")
38+
39+
spark = (
40+
DatabricksSession.builder
41+
.profile("my-workspace") # optional: run on a specific profile from ~/.databrickscfg instead of default
42+
.withEnvironment(env)
43+
.serverless(True)
44+
.getOrCreate()
45+
)
46+
47+
# Spark code now executes on Databricks serverless
48+
df = spark.range(1000)...
49+
df.write.mode('overwrite').saveAsTable("catalog.schema.table")
50+
```
51+
52+
## Common Issues
53+
54+
| Issue | Solution |
55+
|-------|----------|
56+
| `Python 3.12 required` | create venv with correct python version |
57+
| `DatabricksEnv not found` | Upgrade to databricks-connect >= 16.4 |
58+
| `serverless_compute_id` error | Add `serverless_compute_id = auto` to ~/.databrickscfg |
59+
| `ModuleNotFoundError` inside UDF | Add the package to `withDependencies()` |
60+
| `PERSIST TABLE not supported` | Don't use `.cache()` or `.persist()` with serverless |
61+
| `broadcast` is used | Don't broadcast small DF using spark connect, have a small python list instead or join small DF |
62+
63+
## When NOT to Use
64+
65+
Switch to **[Serverless Job](2-serverless-job.md)** when:
66+
- one-off execution
67+
- Heavy ML training that shouldn't depend on local machine staying connected
68+
- Non-Spark Python code (pure sklearn, pytorch, etc.)
69+
70+
Switch to **[Interactive Cluster](3-interactive-cluster.md)** when:
71+
- Need state across multiple separate MCP tool calls
72+
- Need Scala or R support
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Serverless Job Execution
2+
3+
**Use when:** Running intensive Python code remotely (ML training, heavy processing) that doesn't need Spark, or when code shouldn't depend on local machine staying connected.
4+
5+
## When to Choose Serverless Job
6+
7+
- ML model training (runs independently of local machine)
8+
- Heavy non-Spark Python processing
9+
- Code that takes > 5 minutes (local connection can drop)
10+
- Production/scheduled runs
11+
12+
## Trade-offs
13+
14+
| Pro | Con |
15+
|-----|-----|
16+
| No cluster to manage | ~25-50s cold start each invocation |
17+
| Up to 30 min timeout | No state preserved between calls |
18+
| Independent execution | print() unreliable—use `dbutils.notebook.exit()` |
19+
20+
## Executing code
21+
### Prefer running from a Local File (edit the local file then run it)
22+
23+
```python
24+
execute_code(
25+
file_path="/local/path/to/train_model.py",
26+
compute_type="serverless"
27+
)
28+
```
29+
30+
## Jobs with Custom Dependencies
31+
32+
Use `job_extra_params` to install pip packages:
33+
34+
```python
35+
execute_code(
36+
file_path="/path/to/train.py",
37+
job_extra_params={
38+
"environments": [{
39+
"environment_key": "ml_env",
40+
"spec": {"client": "4", "dependencies": ["scikit-learn", "pandas", "mlflow"]}
41+
}]
42+
}
43+
)
44+
```
45+
46+
**CRITICAL:** Use `"client": "4"` in the spec. `"client": "1"` won't install dependencies.
47+
48+
## Output Handling
49+
50+
```python
51+
# ❌ BAD - print() may not be captured
52+
print("Training complete!")
53+
54+
# ✅ GOOD - Use dbutils.notebook.exit()
55+
import json
56+
results = {"accuracy": 0.95, "model_path": "/Volumes/..."}
57+
dbutils.notebook.exit(json.dumps(results))
58+
```
59+
60+
## Common Issues
61+
62+
| Issue | Solution |
63+
|-------|----------|
64+
| print() output missing | Use `dbutils.notebook.exit()` |
65+
| `ModuleNotFoundError` | Add to environments spec with `"client": "4"` |
66+
| Job times out | Max is 1800s; split into smaller tasks |
67+
68+
## When NOT to Use
69+
70+
Switch to **[Databricks Connect](1-databricks-connect.md)** when:
71+
- Iterating on Spark code and want instant feedback
72+
- Need local debugging with breakpoints
73+
74+
Switch to **[Interactive Cluster](3-interactive-cluster.md)** when:
75+
- Need state across multiple MCP tool calls
76+
- Need Scala or R support

0 commit comments

Comments
 (0)