Skip to content

feat: sift parquet integration — interactive DataFrame rendering #1453

@rgbkrk

Description

@rgbkrk

Problem

DataFrames displayed in nteract notebooks currently render as text/html tables (pandas) or text/plain (fallback). We want them to render as interactive, filterable, sortable tables using sift (@nteract/sift) — a fast dataframe viewer built on pretext + WASM (100k+ rows at 120fps).

Architecture

Data flow: Kernel → RuntimeAgent → Daemon → Frontend

Kernel (Python)                  RuntimeAgent              Daemon            Frontend
  |                                  |                       |                  |
  | df.to_parquet(buf)               |                       |                  |
  | blob_upload(parquet) ----------->| blob_store.put()      |                  |
  |                                  | → hash                |                  |
  | display_data on IOPub:           |                       |                  |
  |   {blob_hash, text/plain}        |                       |                  |
  |                                  |                       |                  |
  |   [agent IOPub task picks up] -->|                       |                  |
  |                                  | create_manifest()     |                  |
  |                                  | store in blob store   |                  |
  |                                  | write hash to CRDT    |                  |
  |                                  |--- RuntimeStateSync ->|                  |
  |                                  |                       |--- sync frame -->|
  |                                  |                       |                  |
  |                                  |                       |  resolve manifest|
  |                                  |                       |  GET /blob/{hash}|
  |                                  |                       |  SiftTable render|

Key insight: Parquet bytes bypass IOPub entirely. The kernel writes directly to the blob store (same filesystem as agent), then emits a lightweight JSON reference on IOPub. The agent's normal manifest pipeline picks up the reference. The frontend resolves the blob URL and sift's WASM decodes the parquet client-side.

Why out-of-band (not IOPub)

  • IOPub base64-encodes binary → 33% inflation (100MB parquet → 133MB on wire → decoded back)
  • The agent's blob store is the same filesystem directory — direct writes have no protocol overhead
  • Display data on IOPub is just {"blob_hash": "..."} — a few bytes

MIME type

application/vnd.nteract.dataframe+parquet

The data field is a JSON string with the blob hash. All schema metadata (columns, types, row count) lives inside the parquet file itself — parquet is self-describing. No separate metadata needed.


Components

1. Repo structure: monorepo package at packages/sift/

Add sift as a pnpm workspace package. pnpm-workspace.yaml already has packages/*. The nteract-predicate WASM crate joins the Cargo workspace. Sift's standalone dev workflow (cd packages/sift && pnpm dev) is preserved for fast iteration.

2. Frontend: DataFrameOutput component

  • Register custom MIME type in MediaProvider (same pattern as widget-view in App.tsx)
  • Add to MAIN_DOM_SAFE_TYPES (sift is pure DOM, no script execution risk)
  • Add to DEFAULT_PRIORITY above text/html
  • DataFrameOutput resolves blob URL via blob port, renders <SiftTable url={blobUrl} />
  • Sift's WASM decodes parquet → virtual scrolled table with filter/sort/crossfilter

3. Python: IPython display formatter + blob upload

Follow the pattern pandas uses for application/vnd.dataresource+json (pandas/io/formats/printing.py:302):

  • Register a custom IPython formatter for our MIME type
  • Handles both pandas.DataFrame and polars.DataFrame (by type, not method)
  • On display: df.to_parquet(buf) → write to blob store → return {blob_hash} + text/plain fallback
  • Non-invasive: adds alongside existing text/html, other frontends fall back gracefully

Research findings:

  • pandas: has _repr_html_(), no _repr_mimebundle_(). Has precedent for custom MIME formatters via IPython's display_formatter.formatters
  • polars: has _repr_html_() only, no _repr_mimebundle_(). Our formatter registers by type so it works

4. Output widget / iframe handling

DataFrames inside ipywidgets.Output render in iframes. Start with a hybrid: sift in main DOM, fall back to text/html in iframe contexts (the isInIframe() check already exists in MediaRouter). Revisit iframe sift rendering later if needed.

5. Daemon: no changes needed

Blob store is content-addressed and media-type-agnostic. The custom MIME type's data is JSON text → normal manifest pipeline.


Phasing

Phase 1: Sift in monorepo + frontend renderer

  • Copy sift source into packages/sift/
  • Add nteract-predicate to Cargo workspace, wire up WASM build
  • Create DataFrameOutput component + MIME registration
  • Manual test with parquet in blob store
  • Verify sift standalone dev still works

Phase 2: Python formatter + blob upload (end-to-end)

  • blob.upload() in runtimed (direct filesystem write)
  • IPython display formatter for pandas + polars
  • Auto-registration on kernel start
  • Test: pd.DataFrame({"x": range(100_000)}) → sift table

Phase 3: Polish

  • Output widget iframe fallback
  • Truncation UX for large DataFrames
  • Theme integration (dark/light mode)
  • .ipynb save with text/plain + text/html fallbacks
  • Streaming by row group for very large DataFrames

Open questions

  1. WASM in Tauri bundlenteract-predicate.wasm needs to be in the app assets. Copy pipeline TBD.
  2. Formatter auto-registration — agent-injected startup code vs IPython extension vs kernel spec hook?
  3. Max DataFrame size — UX for exceeding 100MB blob limit? Truncate? Warn?
  4. Remote kernels — blob upload API is abstract (blob_store.upload()) so transport can change for SSH agents (feat(runtimed): SSH remote runtimes #1334). Not solving now.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions