-
Notifications
You must be signed in to change notification settings - Fork 2
feat: sift parquet integration — interactive DataFrame rendering #1453
Description
Problem
DataFrames displayed in nteract notebooks currently render as text/html tables (pandas) or text/plain (fallback). We want them to render as interactive, filterable, sortable tables using sift (@nteract/sift) — a fast dataframe viewer built on pretext + WASM (100k+ rows at 120fps).
Architecture
Data flow: Kernel → RuntimeAgent → Daemon → Frontend
Kernel (Python) RuntimeAgent Daemon Frontend
| | | |
| df.to_parquet(buf) | | |
| blob_upload(parquet) ----------->| blob_store.put() | |
| | → hash | |
| display_data on IOPub: | | |
| {blob_hash, text/plain} | | |
| | | |
| [agent IOPub task picks up] -->| | |
| | create_manifest() | |
| | store in blob store | |
| | write hash to CRDT | |
| |--- RuntimeStateSync ->| |
| | |--- sync frame -->|
| | | |
| | | resolve manifest|
| | | GET /blob/{hash}|
| | | SiftTable render|
Key insight: Parquet bytes bypass IOPub entirely. The kernel writes directly to the blob store (same filesystem as agent), then emits a lightweight JSON reference on IOPub. The agent's normal manifest pipeline picks up the reference. The frontend resolves the blob URL and sift's WASM decodes the parquet client-side.
Why out-of-band (not IOPub)
- IOPub base64-encodes binary → 33% inflation (100MB parquet → 133MB on wire → decoded back)
- The agent's blob store is the same filesystem directory — direct writes have no protocol overhead
- Display data on IOPub is just
{"blob_hash": "..."}— a few bytes
MIME type
application/vnd.nteract.dataframe+parquet
The data field is a JSON string with the blob hash. All schema metadata (columns, types, row count) lives inside the parquet file itself — parquet is self-describing. No separate metadata needed.
Components
1. Repo structure: monorepo package at packages/sift/
Add sift as a pnpm workspace package. pnpm-workspace.yaml already has packages/*. The nteract-predicate WASM crate joins the Cargo workspace. Sift's standalone dev workflow (cd packages/sift && pnpm dev) is preserved for fast iteration.
2. Frontend: DataFrameOutput component
- Register custom MIME type in
MediaProvider(same pattern as widget-view inApp.tsx) - Add to
MAIN_DOM_SAFE_TYPES(sift is pure DOM, no script execution risk) - Add to
DEFAULT_PRIORITYabovetext/html DataFrameOutputresolves blob URL via blob port, renders<SiftTable url={blobUrl} />- Sift's WASM decodes parquet → virtual scrolled table with filter/sort/crossfilter
3. Python: IPython display formatter + blob upload
Follow the pattern pandas uses for application/vnd.dataresource+json (pandas/io/formats/printing.py:302):
- Register a custom IPython formatter for our MIME type
- Handles both
pandas.DataFrameandpolars.DataFrame(by type, not method) - On display:
df.to_parquet(buf)→ write to blob store → return{blob_hash}+text/plainfallback - Non-invasive: adds alongside existing
text/html, other frontends fall back gracefully
Research findings:
- pandas: has
_repr_html_(), no_repr_mimebundle_(). Has precedent for custom MIME formatters via IPython'sdisplay_formatter.formatters - polars: has
_repr_html_()only, no_repr_mimebundle_(). Our formatter registers by type so it works
4. Output widget / iframe handling
DataFrames inside ipywidgets.Output render in iframes. Start with a hybrid: sift in main DOM, fall back to text/html in iframe contexts (the isInIframe() check already exists in MediaRouter). Revisit iframe sift rendering later if needed.
5. Daemon: no changes needed
Blob store is content-addressed and media-type-agnostic. The custom MIME type's data is JSON text → normal manifest pipeline.
Phasing
Phase 1: Sift in monorepo + frontend renderer
- Copy sift source into
packages/sift/ - Add
nteract-predicateto Cargo workspace, wire up WASM build - Create
DataFrameOutputcomponent + MIME registration - Manual test with parquet in blob store
- Verify sift standalone dev still works
Phase 2: Python formatter + blob upload (end-to-end)
-
blob.upload()in runtimed (direct filesystem write) - IPython display formatter for pandas + polars
- Auto-registration on kernel start
- Test:
pd.DataFrame({"x": range(100_000)})→ sift table
Phase 3: Polish
- Output widget iframe fallback
- Truncation UX for large DataFrames
- Theme integration (dark/light mode)
- .ipynb save with
text/plain+text/htmlfallbacks - Streaming by row group for very large DataFrames
Open questions
- WASM in Tauri bundle —
nteract-predicate.wasmneeds to be in the app assets. Copy pipeline TBD. - Formatter auto-registration — agent-injected startup code vs IPython extension vs kernel spec hook?
- Max DataFrame size — UX for exceeding 100MB blob limit? Truncate? Warn?
- Remote kernels — blob upload API is abstract (
blob_store.upload()) so transport can change for SSH agents (feat(runtimed): SSH remote runtimes #1334). Not solving now.
Related
- feat(runtimed): SSH remote runtimes #1334 (SSH runtime agents — informs blob upload API design)