Skip to content

Support writing an in-memory DataFrame to ClickHouse via to_clickhouse#592

Open
wudidapaopao wants to merge 2 commits into
chdb-io:mainfrom
wudidapaopao:support_local_dataframe_to_clickhouse
Open

Support writing an in-memory DataFrame to ClickHouse via to_clickhouse#592
wudidapaopao wants to merge 2 commits into
chdb-io:mainfrom
wudidapaopao:support_local_dataframe_to_clickhouse

Conversation

@wudidapaopao

Copy link
Copy Markdown
Contributor

What

DataStore.to_clickhouse() now accepts a pure in-memory pandas DataFrame (source_type == "dataframe") as the source, when an explicit target host is given. Previously this was rejected even with a host.

import pandas as pd
from datastore import DataStore

df = pd.DataFrame({"city": ["Beijing", "Shanghai"], "amount": [100.0, 300.0]})

DataStore(df).to_clickhouse(
    "analytics.sales",
    host="myorg.region.aws.clickhouse.cloud",   # *.clickhouse.cloud / :9440 auto-enables TLS
    user="default", password="...",
    order_by="city",
)

Why

The DataFrame-upload path (_execute() locally → DESCRIBE Python(df)INSERT … FROM Python(df)) never needed a ClickHouse source — it already runs for ClickHouse-sourced stores whose pipeline has Pandas-only ops. A pure-DataFrame store was blocked only by two couplings keyed on the source type rather than the target:

  1. the _require_clickhouse_for_writeback guard (rejects source_type != clickhouse), and
  2. _get_adapter() (no adapter for "dataframe").

The host= parameter sets where to write but didn't change either decision, so an explicit ClickHouse target was ignored. This is a guard/adapter coupling, not a technical limit.

How

to_clickhouse() detects a pure-DataFrame source and, when an explicit host is given, runs the existing DataFrame-upload path with the target treated as ClickHouse — via a scoped source_type override in _target_server_context that is restored even on exception. An explicit, non-empty host is required (a df has no source server to default to); omitting it raises a clear DataStoreError.

Scope is deliberately narrow:

  • create_view / create_materialized_view / save() still reject a pure df — they need a server-side SQL source a local df cannot provide (a VIEW stores a query, an MV is an insert-trigger on a server-side table; Python(df) is local-only and can't be persisted into a server-side definition).
  • non-ClickHouse sources (mysql/postgresql) remain rejected by _require_clickhouse_for_writeback.

Tests

New TestToClickHouseLocalDataFrame (+ a cross-server case), all asserting exact values and row order per the repo's testing principles:

  • fail / replace (new + refuse-existing) / append, with value + order checks
  • engine / order_by (list) / partition_by
  • index=True + index_label
  • a Pandas transform on a local df
  • no-host and empty-host guard → DataStoreError
  • source_type restored after success and after a raised exception
  • scope-boundary: create_view / create_materialized_view (with explicit host) and save(view/MV) still reject a local df
  • cross-server: the table lands on the target server only (not the source)

Verified locally against the .github/ci/clickhouse-pair.yml pair: full test_writeback.py = 73 passed (60 existing + 13 new), no regressions.

🤖 Generated with Claude Code

… via to_clickhouse

A DataStore wrapping a pure in-memory pandas DataFrame (source_type ==
"dataframe") was rejected by to_clickhouse() even with an explicit target,
because both the _require_clickhouse_for_writeback guard and _get_adapter()
key off the *source* type, not the target. The DataFrame-upload machinery
itself (local _execute() -> DESCRIBE Python(df) -> INSERT ... FROM Python(df))
never needed a ClickHouse source, so this was a guard/adapter coupling, not a
technical limit.

to_clickhouse() now detects a pure-DataFrame source and, when an explicit
host is given, routes through the existing DataFrame-upload path with the
target treated as ClickHouse (via a scoped source_type override in
_target_server_context that is restored even on exception). An explicit,
non-empty host is required for a df source (no source server to default to);
omitting it raises a clear DataStoreError.

Scope is intentionally limited to to_clickhouse:
- create_view / create_materialized_view / save() still reject a pure df
  (they need a server-side SQL source a local df cannot provide);
- non-ClickHouse sources (mysql/postgresql) remain rejected by
  _require_clickhouse_for_writeback.

Adds TestToClickHouseLocalDataFrame: fail/replace/append with exact-value and
row-order assertions, engine/order_by(list)/partition_by, index/index_label,
a Pandas transform, the no-host/empty-host guard, source_type restore, and
scope-boundary tests proving create_view / create_materialized_view (with an
explicit host) and save(view/MV) still reject a local df. Plus a cross-server
case asserting the table lands on the target server only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wudidapaopao

Copy link
Copy Markdown
Contributor Author

@chibugai, review this PR.

@wudidapaopao wudidapaopao changed the title feat(writeback): support writing an in-memory DataFrame to ClickHouse via to_clickhouse Support writing an in-memory DataFrame to ClickHouse via to_clickhouse Jun 19, 2026
@chibugai

Copy link
Copy Markdown

Reviewed (OCR opus-4-8, --max-tools 40, + manual read of the diff). No blocking correctness or safety issues found — advisory only, not approving/requesting changes.

What I checked on the two files (datastore/core.py, datastore/tests/test_writeback.py):

  • Exception safety of the temporary override_target_server_context now also swaps source_type to "clickhouse" when as_clickhouse_target=True, and restores both _remote_params and source_type in the finally. test_replace_new_creates_then_refuses_existing and test_source_type_restored_and_store_still_usable confirm the override doesn't leak after a raise. Good.
  • Scope containment — the relaxation is to_clickhouse-only. create_view / create_materialized_view / save(type=view|materialized_view) on a pure-df store stay rejected even with an explicit host, and _require_clickhouse_for_writeback still blocks mysql/postgresql sources. The three scope-guard tests lock this in.
  • No new injection / type-mapping surface — uploads route through the existing Python(df) table-function path; no identifier/value handling is changed by this PR.
  • not host vs is None — intentional, so a blank host string gets the clear DataStoreError up front rather than a generic failure deeper in the upload path; test_requires_explicit_host exercises both.

Test coverage is thorough (fail/replace/append, engine+partition, index_label, transform-flow-through, cross-server isolation). LGTM from my side.

…ickhouse

The local-DataFrame relaxation only recognized source_type=='dataframe' (the
DataStore(df) constructor). DataStore.from_df / from_dataframe wrap a frame
differently — they keep the default source_type ('chdb') and stash the frame
in _source_df — so to_clickhouse(host=...) on them hit the
_require_clickhouse_for_writeback guard ("source is 'chdb'") even though the
upload path works for them identically.

Broaden the detection to "wraps a local frame AND has no source host":
  wraps_local_df = source_type=='dataframe' or _source_df is not None
  writing_local_df = wraps_local_df and not _remote_params.get('host')

The "no source host" gate is essential: a pandas-only op (e.g. transform) on a
ClickHouse-sourced store also caches into _source_df, but it keeps its source
host and must keep defaulting its writeback target to that source server rather
than erroring on a missing host. Without the gate, the existing
TestToClickHouseDataFrameUpload cases (CH source + transform, no explicit host)
regress.

Adds tests: from_df / from_dataframe -> to_clickhouse write with exact value &
order checks + source_type restore, and the no-host guard on the from_df
representation. Full writeback suite: 75 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wudidapaopao wudidapaopao requested review from Copilot and removed request for Copilot June 20, 2026 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants