feat: sift parquet integration — interactive DataFrame rendering

## Problem

DataFrames displayed in nteract notebooks currently render as `text/html` tables (pandas) or `text/plain` (fallback). We want them to render as interactive, filterable, sortable tables using **sift** (`@nteract/sift`) — a fast dataframe viewer built on pretext + WASM (100k+ rows at 120fps).

## Architecture

### Data flow: Kernel → RuntimeAgent → Daemon → Frontend

```
Kernel (Python)                  RuntimeAgent              Daemon            Frontend
  |                                  |                       |                  |
  | df.to_parquet(buf)               |                       |                  |
  | blob_upload(parquet) ----------->| blob_store.put()      |                  |
  |                                  | → hash                |                  |
  | display_data on IOPub:           |                       |                  |
  |   {blob_hash, text/plain}        |                       |                  |
  |                                  |                       |                  |
  |   [agent IOPub task picks up] -->|                       |                  |
  |                                  | create_manifest()     |                  |
  |                                  | store in blob store   |                  |
  |                                  | write hash to CRDT    |                  |
  |                                  |--- RuntimeStateSync ->|                  |
  |                                  |                       |--- sync frame -->|
  |                                  |                       |                  |
  |                                  |                       |  resolve manifest|
  |                                  |                       |  GET /blob/{hash}|
  |                                  |                       |  SiftTable render|
```

**Key insight:** Parquet bytes bypass IOPub entirely. The kernel writes directly to the blob store (same filesystem as agent), then emits a lightweight JSON reference on IOPub. The agent's normal manifest pipeline picks up the reference. The frontend resolves the blob URL and sift's WASM decodes the parquet client-side.

### Why out-of-band (not IOPub)

- IOPub base64-encodes binary → 33% inflation (100MB parquet → 133MB on wire → decoded back)
- The agent's blob store is the same filesystem directory — direct writes have no protocol overhead
- Display data on IOPub is just `{"blob_hash": "..."}` — a few bytes

### MIME type

`application/vnd.nteract.dataframe+parquet`

The `data` field is a JSON string with the blob hash. All schema metadata (columns, types, row count) lives inside the parquet file itself — parquet is self-describing. No separate metadata needed.

---

## Components

### 1. Repo structure: monorepo package at `packages/sift/`

Add sift as a pnpm workspace package. `pnpm-workspace.yaml` already has `packages/*`. The `nteract-predicate` WASM crate joins the Cargo workspace. Sift's standalone dev workflow (`cd packages/sift && pnpm dev`) is preserved for fast iteration.

### 2. Frontend: `DataFrameOutput` component

- Register custom MIME type in `MediaProvider` (same pattern as widget-view in `App.tsx`)
- Add to `MAIN_DOM_SAFE_TYPES` (sift is pure DOM, no script execution risk)
- Add to `DEFAULT_PRIORITY` above `text/html`
- `DataFrameOutput` resolves blob URL via blob port, renders `<SiftTable url={blobUrl} />`
- Sift's WASM decodes parquet → virtual scrolled table with filter/sort/crossfilter

### 3. Python: IPython display formatter + blob upload

Follow the pattern pandas uses for `application/vnd.dataresource+json` (`pandas/io/formats/printing.py:302`):

- Register a custom IPython formatter for our MIME type
- Handles both `pandas.DataFrame` and `polars.DataFrame` (by type, not method)
- On display: `df.to_parquet(buf)` → write to blob store → return `{blob_hash}` + `text/plain` fallback
- Non-invasive: adds alongside existing `text/html`, other frontends fall back gracefully

**Research findings:**
- pandas: has `_repr_html_()`, no `_repr_mimebundle_()`. Has precedent for custom MIME formatters via IPython's `display_formatter.formatters`
- polars: has `_repr_html_()` only, no `_repr_mimebundle_()`. Our formatter registers by type so it works

### 4. Output widget / iframe handling

DataFrames inside `ipywidgets.Output` render in iframes. Start with a hybrid: sift in main DOM, fall back to `text/html` in iframe contexts (the `isInIframe()` check already exists in MediaRouter). Revisit iframe sift rendering later if needed.

### 5. Daemon: no changes needed

Blob store is content-addressed and media-type-agnostic. The custom MIME type's data is JSON text → normal manifest pipeline.

---

## Phasing

### Phase 1: Sift in monorepo + frontend renderer
- [ ] Copy sift source into `packages/sift/`
- [ ] Add `nteract-predicate` to Cargo workspace, wire up WASM build
- [ ] Create `DataFrameOutput` component + MIME registration
- [ ] Manual test with parquet in blob store
- [ ] Verify sift standalone dev still works

### Phase 2: Python formatter + blob upload (end-to-end)
- [ ] `blob.upload()` in runtimed (direct filesystem write)
- [ ] IPython display formatter for pandas + polars
- [ ] Auto-registration on kernel start
- [ ] Test: `pd.DataFrame({"x": range(100_000)})` → sift table

### Phase 3: Polish
- [ ] Output widget iframe fallback
- [ ] Truncation UX for large DataFrames
- [ ] Theme integration (dark/light mode)
- [ ] .ipynb save with `text/plain` + `text/html` fallbacks
- [ ] Streaming by row group for very large DataFrames

---

## Open questions

1. **WASM in Tauri bundle** — `nteract-predicate.wasm` needs to be in the app assets. Copy pipeline TBD.
2. **Formatter auto-registration** — agent-injected startup code vs IPython extension vs kernel spec hook?
3. **Max DataFrame size** — UX for exceeding 100MB blob limit? Truncate? Warn?
4. **Remote kernels** — blob upload API is abstract (`blob_store.upload()`) so transport can change for SSH agents (#1334). Not solving now.

## Related

- #1334 (SSH runtime agents — informs blob upload API design)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: sift parquet integration — interactive DataFrame rendering #1453

Problem

Architecture

Data flow: Kernel → RuntimeAgent → Daemon → Frontend

Why out-of-band (not IOPub)

MIME type

Components

1. Repo structure: monorepo package at `packages/sift/`

2. Frontend: `DataFrameOutput` component

3. Python: IPython display formatter + blob upload

4. Output widget / iframe handling

5. Daemon: no changes needed

Phasing

Phase 1: Sift in monorepo + frontend renderer

Phase 2: Python formatter + blob upload (end-to-end)

Phase 3: Polish

Open questions

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: sift parquet integration — interactive DataFrame rendering #1453

Description

Problem

Architecture

Data flow: Kernel → RuntimeAgent → Daemon → Frontend

Why out-of-band (not IOPub)

MIME type

Components

1. Repo structure: monorepo package at packages/sift/

2. Frontend: DataFrameOutput component

3. Python: IPython display formatter + blob upload

4. Output widget / iframe handling

5. Daemon: no changes needed

Phasing

Phase 1: Sift in monorepo + frontend renderer

Phase 2: Python formatter + blob upload (end-to-end)

Phase 3: Polish

Open questions

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Repo structure: monorepo package at `packages/sift/`

2. Frontend: `DataFrameOutput` component