feat: sql command — DuckDB queries over stored documents (CLI, API, MCP, UI)#18
Conversation
…CP, UI) Adds a sandboxed `sql` op that runs DuckDB SQL over drive documents: csv, tsv, parquet, xlsx, json, ndjson/jsonl (text formats also .gz), sqlite (.db/.sqlite/.sqlite3, tables exposed as <name>.<table>), and .duckdb. Documents are referenced by quoted drive-path literals (SELECT * FROM '/data/sales.csv'), bound as named tables, or format- overridden to make any document queryable (-t logs=/raw/data.txt:csv). Security: the sqlite extension bypasses DuckDB's allowed_directories sandbox (own file I/O), so all inputs are materialized into in-memory tables during setup — sqlite via an isolated bridge instance converted to parquet — then the local filesystem is disabled, external access cut, and configuration locked before any user SQL runs. Interrupt-based timeout, per-file size cap, and memory limit are env-tunable. - core: sql op + registry/RBAC (viewer) + data-file mime types - cli: `agent-fs sql` with repeatable -t bindings, stdin, --max-rows, aligned table output; @duckdb/node-api marked external in the bundle - server/mcp: exposed automatically via ops dispatch + tool registry - live: SQL workbench at /sql/~/:orgId/:driveId — DuckDB-WASM in the browser for csv/tsv/parquet/json, automatic server fallback for xlsx/sqlite/duckdb; results table, SVG charts, CSV/JSON export, "Query" button on queryable files - e2e: 15 new tests (CLI + raw API + MCP) against MinIO — suite 96/96 - docs: docs/sql.md + landing registration + SKILL.md; versions synced to 0.8.2
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f564274f59
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…SQL UX Adds a shared DataGrid (TanStack Table + react-virtual) used by both the SQL results panel and a new default preview for tabular/data documents, plus a batch of SQL workbench UX fixes from QA. DataGrid & previews: - live/src/components/data-grid/DataGrid.tsx — headless, virtualized, themed grid (sort, resize, type tooltips, dimmed NULLs); replaces the hand-rolled results table. - TablePreviewViewer — csv/tsv/ndjson/parquet/xlsx render as a grid by default via the SQL engine (server-side, so previews work even when the raw-bytes presigned fetch is CORS-blocked); csv/tsv/ndjson keep a Source toggle. - DatabasePreviewViewer — sqlite/duckdb files list their tables and preview the selected one in the grid. - FileViewer: route tabular/db files to these previews; office binaries (doc/docx/ppt/pptx/xls/odt…) no longer render as garbled text — they fall back to the download view. SQL workbench: - Engine control is now a Browser/Server toggle with tooltips, reflecting the effective engine; Browser disabled (with reason) when unsupported. - Browser-load size estimate; sizes on doc chips and preview footers. - sqlite/duckdb seeding introspects tables → seeds a working `schema.table` query (fixes the path-literal error on database files). - Cmd/Ctrl+Enter runs via Monaco addAction (reliable keybinding). - Editor autocompletes bound table and `schema.table` names. Misc: - Switching connection/org/drive resets to `/`. - Correct the MinIO-gated op-registry count assertion (now 29, includes sql).
Update: shared DataGrid + tabular/database file previews + SQL UX (commit b15f9c4)Adds the web UI layer on top of the SQL op:
All verified in a real headless-Chrome pass (images, video, markdown, pdf, csv grid, sqlite table preview, docx fallback). Changes are live/-only except a one-line fix to a MinIO-gated op-registry count assertion (now 29, includes |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b15f9c454e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- sitemap.xml + sitemap.md: add /docs/sql (IndexNow submits it on deploy, since it parses sitemap.xml). - DocsPage: per-document SEO meta — unique <title>, description, canonical, Open Graph, Twitter, and TechArticle JSON-LD — so every doc page (including sql) is independently indexable and answer-engine citable. - Re-sync public/docs/openapi.json to the 0.9.0 spec. llms.txt already lists the SQL page (Reference + core commands); public/docs/sql.md is synced from docs/sql.md by generate:md.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d4dc3430d3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…, cache, engine routing Resolves the 6 P2 review findings: core (packages/core/src/ops/sql.ts): - Only rewrite path literals in FROM/JOIN table position, never value-position strings — `WHERE x = '/data/sales.csv'` is no longer clobbered. - Honor a real .gz suffix even when the format is explicitly overridden, so the temp file keeps its .gz extension and DuckDB decompresses it. - Uniquify duplicate result column names (e.g. two projected `id`s) so building row objects never drops data by key collision. live (sql-engine): - formatForPath strips an optional .gz suffix (and deriveTableName too); the document picker adds .gz globs — gzipped text docs are pickable/seedable. - Browser-engine file registration keys on the current file revision (stat), so an edited doc re-registers instead of serving a stale buffer. - canRunInBrowser is query-aware: a FROM/JOIN path literal that isn't a bound doc routes to the server (only it auto-binds bare path literals). + tests for path-literal scoping, gzip override, and duplicate columns.
Addressed all 6 review findings (commit fc2fd19)
Added core tests for #3, #4, #6. Full suite 474 pass / 0 fail; live build clean. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fc2fd19278
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…SM engine Second review pass flagged the browser engine still had two issues the server fix didn't cover: - rewritePathLiterals now only rewrites FROM/JOIN path literals (matching the server), so a value-position string like `WHERE source = '/data/sales.csv'` is no longer clobbered in the browser engine. - The persistent WASM db kept views from earlier runs; drop views that aren't bound in the current run (removed docs, org/drive switch) before querying, so a stale table name can't be selected out of scope. Verified in headless Chrome: a seeded `FROM '/qa/sales.csv'` query runs in the wasm engine (badge "wasm", 5 rows) with the new rewrite/view logic.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 93ffbe5d4f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Third review pass: - duckdb.ts: the browser engine built rows keyed by Arrow field name, so duplicate output columns collided. Uniquify names and read cells by column index (getChildAt) so a query projecting two `id`s keeps both — matching the server-side fix. - DatabasePreviewViewer / TablePreviewViewer: guard async loads with a monotonic request token so a slow response for a previously-selected file/table can't overwrite the current preview (switching files or db tables quickly).
Third review pass addressed (commit 1c18d27)
Verified the WASM happy path still runs in-browser after the row-building change. All prior threads resolved. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1c18d27008
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…t the query Fourth review pass: the deadline was armed only for the query phase, leaving downloads, extension loads, the sqlite->parquet bridge, and table materialization unbounded. Arm the timer before any work so it covers the full operation — it interrupts the active DuckDB connection (extension load / materialization / query) and a checkTimeout() guard stops between phases (downloads and the bridge, where there's no connection to interrupt). The sqlite bridge also checks the deadline between tables.
|
Fourth review pass addressed (commit e0ef818): the query timeout is now armed before any setup work, so downloads, extension loads, the sqlite→parquet bridge, and materialization are all bounded — not just the final query. The active DuckDB connection is interrupted and phase boundaries check the deadline. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e0ef8185e2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…results Fifth review pass: - duckdb.ts: ensureRegistered now also drops registered file buffers that aren't bound in the current run (removed docs, org/drive switch), not just stale views — so an out-of-scope buffer can't be read. - SqlPage: the workbench run() guards its result/error/running updates with a monotonic run token, so a slow earlier run can't clobber a newer one (rapid re-runs or an edited query), matching the preview viewers.
|
Fifth review pass addressed (commit f122a91): WASM engine now drops stale registered file buffers (not just views) for docs no longer in scope, and the workbench run() guards its result updates with a run token so a slow earlier run can't clobber a newer one — matching the preview viewers. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f122a91a44
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Sixth review pass: the sqlite->parquet bridge runs on its own connection, so the deadline could interrupt the main connection but not an in-flight bridge COPY. Register the bridge connection with the timer so a timeout interrupts it too (unregistered before close), completing timeout coverage across every phase.
|
Sixth review pass addressed (commit 486815b): the sqlite→parquet bridge connection is now registered with the query deadline, so a timeout interrupts an in-flight bridge COPY too — completing timeout coverage across download, extension load, bridge, materialization, and query. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 486815bc1b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Seventh review pass: downloads were the last phase the timeout couldn't interrupt — only checked between files. Thread an AbortSignal from the deadline timer into AgentS3Client.getObject so an in-flight download is aborted on timeout. Every phase (download, extension load, sqlite bridge, materialization, query) is now interruptible.
|
Seventh review pass addressed (commit 892ef08): S3 downloads now receive an AbortSignal from the query deadline, so an in-flight download is aborted on timeout. Every phase — download, extension load, sqlite bridge, materialization, and query — is now interruptible. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 892ef08fd2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // (e.g. `WHERE source = '/data/sales.csv'`) is never matched here. | ||
| const seenPaths = new Set<string>(); | ||
| let docIdx = 1; | ||
| for (const match of params.query.matchAll(FROM_JOIN_PATH_RE)) { |
There was a problem hiding this comment.
Ignore path literals inside SQL comments
Because this scans the raw SQL text before DuckDB parses it, FROM '/path.csv' inside a -- or /* ... */ comment is treated as a real document reference. If that commented example points at an existing large file, the request can fail the size limit or spend time downloading/materializing a file the actual query never uses (for example -- FROM '/big.csv'\nSELECT 1). The auto-bind pass should skip comments/string trivia or rely on parsed table references.
Useful? React with 👍 / 👎.
| function starterForDatabase(table: string, tables: string[]): string { | ||
| if (tables.length === 0) return `-- ${table} has no tables` | ||
| const list = tables.map((t) => `${table}.${t}`).join(", ") | ||
| return `-- tables: ${list}\nSELECT * FROM ${table}.${tables[0]} LIMIT 100` |
There was a problem hiding this comment.
Quote database table names in starter queries
When a pre-bound SQLite/DuckDB file's first table is not a plain identifier (e.g. order details, 2025-sales, or a reserved word), this generated starter query is invalid even though the preview path handles those tables by quoting identifiers. Quote both the schema and discovered table name here so opening a database file seeds a runnable query for valid database table names.
Useful? React with 👍 / 👎.
Summary
Adds a sandboxed
sqlop that runs DuckDB SQL over documents stored in a drive, exposed everywhere the ops registry reaches: CLI (agent-fs sql), HTTP API (POST /orgs/:orgId/ops), MCP tool, plus a SQL workbench in the live UI powered by DuckDB-WASM.Formats: csv, tsv, parquet, xlsx, json, ndjson/jsonl (text formats also
.gz), sqlite (.db/.sqlite/.sqlite3— tables exposed as<name>.<table>),.duckdb.Making a document SQL-able: quoted drive-path literals in the query (
FROM '/data/sales.csv'), namedtablesbindings, or a format override for non-standard extensions (-t logs=/raw/data.txt:csv). Docs: docs/sql.md.Security model
The hosted API is multitenant, so user SQL is untrusted. Probing showed DuckDB's
allowed_directoriessandbox is bypassed by the sqlite extension (it performs its own file I/O — a file outside the sandbox was readable post-lockdown viaATTACH). The op therefore:Covered by tests:
/etc/passwdreads, http(s) fetches,ATTACH,sqlite_scan,COPY ... TO, andSETall fail withVALIDATION_ERROR. Limits (env-tunable):AGENT_FS_SQL_TIMEOUT_MS(30s, interrupt-based),AGENT_FS_SQL_MAX_FILE_BYTES(256MB),AGENT_FS_SQL_MEMORY_LIMIT(512MB),maxRows1000 default / 10000 max.Changes
sqlop (packages/core/src/ops/sql.ts), registry + RBAC (viewer), data-file MIME types (parquet/tsv/sqlite/ndjson),@duckdb/node-apipinned exact (1.5.3-r.3) given the 2025 npm supply-chain incident on DuckDB packagesagent-fs sql [query] -t name=path[:format] --max-rows N, stdin queries, aligned-table formatter; dep marked--externalin the npm bundle (per-platform binaries resolve via DuckDB's own optionalDependencies)docs/openapi.jsonregenerated)/sql/~/:orgId/:driveId— document picker, Monaco editor (Cmd+Enter), results table + hand-rolled SVG bar/line charts, CSV/JSON export, "Query" button on queryable files. Dual engine: DuckDB-WASM 1.32.0 in-browser for csv/tsv/parquet/json/ndjson (lazy chunk; bytes fetched via the authenticated API to avoid S3 CORS), automatic server fallback for xlsx/sqlite/duckdb (excel ext is broken in wasm, sqlite_scanner flaky)docs/sql.md, landing docs registration +llms.txt, SKILL.md command table/workflow/triggers, README bullet; versions synced to 0.8.2Testing
bun test packages/corescripts/e2e.tsexercising CLI + raw HTTP API + MCP against Docker MinIO with real parquet/xlsx/sqlite fixtures — suite passes 96/96bun run typecheckclean; fullbun run test415 pass / 0 fail; npm bundle builds;landing/andlive/build under pnpm