databricks-solutions
diff --git a/‎databricks-mcp-server/databricks_mcp_server/tools/compute.py‎
Lines changed: 4 additions & 0 deletions b/‎databricks-mcp-server/databricks_mcp_server/tools/compute.py‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎databricks-skills/databricks-execution-compute/SKILL.md‎
Lines changed: 41 additions & 153 deletions b/‎databricks-skills/databricks-execution-compute/SKILL.md‎
Lines changed: 41 additions & 153 deletions
diff --git a/‎databricks-skills/databricks-execution-compute/references/1-databricks-connect.md‎
Lines changed: 72 additions & 0 deletions b/‎databricks-skills/databricks-execution-compute/references/1-databricks-connect.md‎
Lines changed: 72 additions & 0 deletions
diff --git a/‎databricks-skills/databricks-execution-compute/references/2-serverless-job.md‎
Lines changed: 76 additions & 0 deletions b/‎databricks-skills/databricks-execution-compute/references/2-serverless-job.md‎
Lines changed: 76 additions & 0 deletions
@@ -58,6 +58,7 @@ def execute_code(
     destroy_context_on_completion: bool = False,
     workspace_path: str = None,
     run_name: str = None,
+    job_extra_params: Dict[str, Any] = None,
 ) -> Dict[str, Any]:
     """Execute code on Databricks via serverless or cluster compute.
 
@@ -72,6 +73,8 @@ def execute_code(
     file_path: Run local file (.py/.scala/.sql/.r), auto-detects language.
     workspace_path: Save as notebook in workspace (omit for ephemeral).
     .ipynb: Pass raw JSON with serverless, auto-detected.
+    job_extra_params: Extra job params (serverless only). For dependencies:
+        {"environments": [{"environment_key": "env", "spec": {"client": "4", "dependencies": ["pandas", "sklearn"]}}]}
 
     Timeouts: serverless=1800s, cluster=120s, file=600s.
     Returns: {success, output, error, cluster_id, context_id} or {run_id, run_url}."""
@@ -137,6 +140,7 @@ def execute_code(
             run_name=run_name,
             cleanup=workspace_path is None,
             workspace_path=workspace_path,
+            job_extra_params=job_extra_params,
         )
         return result.to_dict()
 
 
@@ -13,182 +13,70 @@ description: >-
 
 # Databricks Execution & Compute
 
-Run code on Databricks and manage compute resources — all through 4 consolidated MCP tools.
+Run code on Databricks. Three execution modes—choose based on workload.
 
-## MCP Tools
-
-### execute_code
-
-Single entry point for all code execution on Databricks.
-
-| Parameter | Type | Default | Description |
-|-----------|------|---------|-------------|
-| `code` | string | None | Code to execute. Required unless `file_path` is set |
-| `file_path` | string | None | Local file to execute (.py, .scala, .sql, .r). Language auto-detected |
-| `compute_type` | string | `"auto"` | `"serverless"`, `"cluster"`, or `"auto"` |
-| `cluster_id` | string | auto-selected | Cluster to run on (cluster compute) |
-| `context_id` | string | None | Reuse execution context for state preservation (cluster compute) |
-| `language` | string | `"python"` | `"python"`, `"scala"`, `"sql"`, or `"r"` |
-| `timeout` | int | varies | Max wait seconds. Defaults: serverless=1800, cluster=120, file=600 |
-| `destroy_context_on_completion` | bool | `false` | Destroy context after execution (cluster compute) |
-| `workspace_path` | string | None | Persist notebook at this workspace path |
-| `run_name` | string | auto-generated | Human-readable run name (serverless only) |
-
-**compute_type resolution ("auto" mode):**
-- Has `cluster_id` or `context_id` → cluster
-- Language is scala/r → cluster
-- Otherwise → serverless
-
-**When to use each compute_type:**
-
-| Scenario | compute_type | Why |
-|----------|-------------|-----|
-| Run Python, no cluster available | `"serverless"` | No cluster needed; serverless spins up automatically |
-| Run local file on a cluster | `"cluster"` + `file_path` | Auto-detects language; supports Python, Scala, SQL, R |
-| Interactive iteration (preserve variables) | `"cluster"` | Keep context alive across calls via `context_id` |
-| SQL queries that need result rows | Use `execute_sql` tool | Works with SQL warehouses; returns data |
-| Batch/ETL Python | `"serverless"` | Dedicated resources, up to 30 min timeout |
-
-### manage_cluster
-
-Create, modify, start, terminate, or delete clusters.
-
-| Parameter | Type | Default | Description |
-|-----------|------|---------|-------------|
-| `action` | string | *(required)* | `"create"`, `"modify"`, `"start"`, `"terminate"`, or `"delete"` |
-| `cluster_id` | string | None | Required for modify, start, terminate, delete |
-| `name` | string | None | Required for create, optional for modify |
-| `num_workers` | int | `1` | Fixed worker count (ignored if autoscale is set) |
-| `spark_version` | string | latest LTS | DBR version key |
-| `node_type_id` | string | auto-picked | Worker node type |
-| `autotermination_minutes` | int | `120` | Minutes of inactivity before auto-stop |
-| `data_security_mode` | string | `"SINGLE_USER"` | Security mode |
-| `spark_conf` | string (JSON) | None | Spark config overrides |
-| `autoscale_min_workers` | int | None | Min workers for autoscaling |
-| `autoscale_max_workers` | int | None | Max workers for autoscaling |
-
-**DESTRUCTIVE:** `"delete"` is permanent and irreversible — always confirm with user.
-**COSTLY:** `"start"` consumes cloud resources (3-8 min startup) — always ask user first.
-
-### manage_sql_warehouse
-
-Create, modify, or delete SQL warehouses.
-
-| Parameter | Type | Default | Description |
-|-----------|------|---------|-------------|
-| `action` | string | *(required)* | `"create"`, `"modify"`, or `"delete"` |
-| `warehouse_id` | string | None | Required for modify and delete |
-| `name` | string | None | Required for create |
-| `size` | string | `"Small"` | T-shirt size (2X-Small through 4X-Large) |
-| `min_num_clusters` | int | `1` | Minimum clusters |
-| `max_num_clusters` | int | `1` | Maximum clusters for scaling |
-| `auto_stop_mins` | int | `120` | Auto-stop after inactivity |
-| `warehouse_type` | string | `"PRO"` | PRO or CLASSIC |
-| `enable_serverless` | bool | `true` | Enable serverless compute |
-
-**DESTRUCTIVE:** `"delete"` is permanent — always confirm with user.
-
-For listing warehouses, use the `manage_warehouse(action="list")` tool (SQL tools).
-
-### list_compute
-
-List and inspect compute resources.
-
-| Parameter | Type | Default | Description |
-|-----------|------|---------|-------------|
-| `resource` | string | `"clusters"` | `"clusters"`, `"node_types"`, or `"spark_versions"` |
-| `cluster_id` | string | None | Get status for a specific cluster (poll after starting) |
-| `auto_select` | bool | `false` | Return the best running cluster (prefers "shared" > "demo") |
-
-## Ephemeral vs Persistent Mode
+## Execution Mode Decision Matrix
 
-All execution supports two modes:
+| Aspect | [Databricks Connect](references/1-databricks-connect.md) ⭐ | [Serverless Job](references/2-serverless-job.md) | [Interactive Cluster](references/3-interactive-cluster.md) |
+|--------|-------------------|----------------|---------------------|
+| **Use for** | Spark code (ETL, data gen) | Heavy processing (ML) | State across tool calls, Scala/R |
+| **Startup** | Instant | ~25-50s cold start | ~5min if stopped |
+| **State** | Within Python process | None | Via context_id |
+| **Languages** | Python (PySpark) | Python, SQL | Python, Scala, SQL, R |
+| **Dependencies** | `withDependencies()` | CLI with environments spec | Install on cluster |
 
-**Ephemeral (default):** Code is executed and no artifact is saved. Good for testing, exploration.
+### Decision Flow
 
-**Persistent:** Pass `workspace_path` to save as a notebook in Databricks workspace. Good for model training, ETL, project work. Suggest paths like:
-`/Workspace/Users/{username}/{project-name}/`
-
-## When No Cluster Is Available
-
-If cluster execution finds no running cluster:
-1. The error response includes `startable_clusters` and `suggestions`
-2. Ask the user if they want to start a terminated cluster (3-8 min startup)
-3. Or suggest `compute_type="serverless"` for Python (no cluster needed)
-4. Or suggest `execute_sql` for SQL workloads (uses SQL warehouses)
-
-## Limitations
-
-| Limitation | Applies To | Details |
-|-----------|------------|---------|
-| Cold start ~25-50s | Serverless | Serverless compute spin-up time |
-| No interactive state | Serverless | Each invocation is fresh; no variables persist |
-| Python and SQL only | Serverless | No R, Scala, or Java on serverless |
-| SQL SELECT not captured | Serverless | Use `execute_sql` for SELECT queries |
-| Cluster must be running | Cluster | Use manage_cluster start or switch to serverless |
-| print() output unreliable | Serverless | Use `dbutils.notebook.exit()` instead |
-
-## Quick Start Examples
-
-### Run Python on serverless
-
-```python
-execute_code(code="dbutils.notebook.exit('hello from serverless')")
 ```
+Spark-based code? → Databricks Connect (fastest)
+  └─ Python 3.12 missing? → Install it + databricks-connect
+  └─ Install fails? → Ask user (don't auto-switch modes)
 
-### Run Python on serverless (persistent)
-
-```python
-execute_code(
-    code=training_code,
-    workspace_path="/Workspace/Users/user@company.com/ml-project/train",
-    run_name="model-training-v1"
-)
+Heavy/long-running (ML)? → Serverless Job (independent)
+Need state across calls? → Interactive Cluster (list and ask which one to use)
+Scala/R? → Interactive Cluster (list and ask which one to use)
 ```
 
-### Run a local file on a cluster
 
-```python
-execute_code(file_path="/local/path/to/etl.py", compute_type="cluster")
-```
-
-### Interactive iteration on a cluster
+## How to Run Code
 
-```python
-# First call — creates context
-result = execute_code(code="import pandas as pd\ndf = pd.DataFrame({'a': [1,2,3]})", compute_type="cluster")
-# Follow-up — reuses context
-execute_code(code="print(df.shape)", context_id=result["context_id"], cluster_id=result["cluster_id"])
-```
+**Read the reference file for your chosen mode before proceeding.**
 
-### Create a cluster
+### Databricks Connect (no MCP tool, run locally) → [reference](references/1-databricks-connect.md)
 
-```python
-manage_cluster(action="create", name="my-dev-cluster", num_workers=2)
+```bash
+python my_spark_script.py
 ```
 
-### Create an autoscaling cluster
+### Serverless Job → [reference](references/2-serverless-job.md)
 
 ```python
-manage_cluster(action="create", name="scaling-cluster", autoscale_min_workers=1, autoscale_max_workers=8)
+execute_code(file_path="/path/to/script.py")
 ```
 
-### Start a terminated cluster
+### Interactive Cluster → [reference](references/3-interactive-cluster.md)
 
 ```python
-manage_cluster(action="start", cluster_id="1234-567890-abcdef")
-# Poll until running
-list_compute(resource="clusters", cluster_id="1234-567890-abcdef")
+# Check for running clusters first (or use the one instructed)
+list_compute(resource="clusters")
+# Ask the customer which one to use
+
+# Run code, reuse context_id for follow-up MCP call
+result = execute_code(code="...", compute_type="cluster", cluster_id="...")
+execute_code(code="...", context_id=result["context_id"], cluster_id=result["cluster_id"])
 ```
 
-### Create a SQL warehouse
+## MCP Tools
 
-```python
-manage_sql_warehouse(action="create", name="analytics-wh", size="Medium")
-```
+| Tool | For | Purpose |
+|------|-----|---------|
+| `execute_code` | Serverless, Interactive | Run code remotely |
+| `list_compute` | Interactive | List clusters, check status, auto-select running cluster |
+| `manage_cluster` | Interactive | Create, start, terminate, delete. **COSTLY:** `start` takes 3-8 min—ask user |
+| `manage_sql_warehouse` | SQL | Create, modify, delete SQL warehouses |
 
 ## Related Skills
 
-- **[databricks-jobs](../databricks-jobs/SKILL.md)** — Production job orchestration with scheduling, retries, and multi-task DAGs
-- **[databricks-dbsql](../databricks-dbsql/SKILL.md)** — SQL warehouse capabilities and AI functions
-- **[databricks-python-sdk](../databricks-python-sdk/SKILL.md)** — Direct SDK usage for workspace automation
+- **[databricks-synthetic-data-gen](../databricks-synthetic-data-gen/SKILL.md)** — Data generation using Spark + Faker
+- **[databricks-jobs](../databricks-jobs/SKILL.md)** — Production job orchestration
+- **[databricks-dbsql](../databricks-dbsql/SKILL.md)** — SQL warehouse and AI functions
@@ -0,0 +1,72 @@
+# Databricks Connect (Recommended Default)
+
+**Use when:** Running Spark code locally that executes on Databricks serverless compute. This is the fastest, cleanest approach for data generation, ETL, and any Spark workload.
+
+## Why Databricks Connect First?
+
+- **Instant iteration** — Edit file, re-run immediately
+- **Local debugging** — IDE debugger, breakpoints work
+- **No cold start** — Session stays warm across executions
+- **Clean dependencies** — `withDependencies()` installs packages on remote compute
+
+## Requirements
+
+- **Python 3.12** (databricks-connect >= 16.4 requires it)
+- **databricks-connect >= 16.4** package
+- **~/.databrickscfg** with serverless config
+
+## Setup
+
+**Python 3.12 required.** If not available, install it (uv or other). If install fails, ask user—don't auto-switch modes.
+
+Use default profile, if not setup you can add it `~/.databrickscfg` (never overwrite it without conscent)
+```ini
+[DEFAULT]
+host = https://your-workspace.cloud.databricks.com/
+serverless_compute_id = auto
+auth_type = databricks-cli
+```
+
+## Usage Pattern
+
+```python
+from databricks.connect import DatabricksSession, DatabricksEnv
+
+# Declare dependencies installed on serverless compute
+# CRITICAL: Include ALL packages used inside UDFs (pandas/numpy are there by default)
+env = DatabricksEnv().withDependencies("faker", "holidays")
+
+spark = (
+    DatabricksSession.builder
+    .profile("my-workspace")  # optional: run on a specific profile from ~/.databrickscfg instead of default
+    .withEnvironment(env)
+    .serverless(True)
+    .getOrCreate()
+)
+
+# Spark code now executes on Databricks serverless
+df = spark.range(1000)...
+df.write.mode('overwrite').saveAsTable("catalog.schema.table")
+```
+
+## Common Issues
+
+| Issue | Solution |
+|-------|----------|
+| `Python 3.12 required` | create venv with correct python version |
+| `DatabricksEnv not found` | Upgrade to databricks-connect >= 16.4 |
+| `serverless_compute_id` error | Add `serverless_compute_id = auto` to ~/.databrickscfg |
+| `ModuleNotFoundError` inside UDF | Add the package to `withDependencies()` |
+| `PERSIST TABLE not supported` | Don't use `.cache()` or `.persist()` with serverless |
+| `broadcast` is used | Don't broadcast small DF using spark connect, have a small python list instead or join small DF |
+
+## When NOT to Use
+
+Switch to **[Serverless Job](2-serverless-job.md)** when:
+- one-off execution
+- Heavy ML training that shouldn't depend on local machine staying connected
+- Non-Spark Python code (pure sklearn, pytorch, etc.)
+
+Switch to **[Interactive Cluster](3-interactive-cluster.md)** when:
+- Need state across multiple separate MCP tool calls
+- Need Scala or R support
@@ -0,0 +1,76 @@
+# Serverless Job Execution
+
+**Use when:** Running intensive Python code remotely (ML training, heavy processing) that doesn't need Spark, or when code shouldn't depend on local machine staying connected.
+
+## When to Choose Serverless Job
+
+- ML model training (runs independently of local machine)
+- Heavy non-Spark Python processing
+- Code that takes > 5 minutes (local connection can drop)
+- Production/scheduled runs
+
+## Trade-offs
+
+| Pro | Con |
+|-----|-----|
+| No cluster to manage | ~25-50s cold start each invocation |
+| Up to 30 min timeout | No state preserved between calls |
+| Independent execution | print() unreliable—use `dbutils.notebook.exit()` |
+
+## Executing code
+### Prefer running from a Local File (edit the local file then run it)
+
+```python
+execute_code(
+    file_path="/local/path/to/train_model.py",
+    compute_type="serverless"
+)
+```
+
+## Jobs with Custom Dependencies
+
+Use `job_extra_params` to install pip packages:
+
+```python
+execute_code(
+    file_path="/path/to/train.py",
+    job_extra_params={
+        "environments": [{
+            "environment_key": "ml_env",
+            "spec": {"client": "4", "dependencies": ["scikit-learn", "pandas", "mlflow"]}
+        }]
+    }
+)
+```
+
+**CRITICAL:** Use `"client": "4"` in the spec. `"client": "1"` won't install dependencies.
+
+## Output Handling
+
+```python
+# ❌ BAD - print() may not be captured
+print("Training complete!")
+
+# ✅ GOOD - Use dbutils.notebook.exit()
+import json
+results = {"accuracy": 0.95, "model_path": "/Volumes/..."}
+dbutils.notebook.exit(json.dumps(results))
+```
+
+## Common Issues
+
+| Issue | Solution |
+|-------|----------|
+| print() output missing | Use `dbutils.notebook.exit()` |
+| `ModuleNotFoundError` | Add to environments spec with `"client": "4"` |
+| Job times out | Max is 1800s; split into smaller tasks |
+
+## When NOT to Use
+
+Switch to **[Databricks Connect](1-databricks-connect.md)** when:
+- Iterating on Spark code and want instant feedback
+- Need local debugging with breakpoints
+
+Switch to **[Interactive Cluster](3-interactive-cluster.md)** when:
+- Need state across multiple MCP tool calls
+- Need Scala or R support