Support writing an in-memory DataFrame to ClickHouse via to_clickhouse#592
Open
wudidapaopao wants to merge 2 commits into
Open
Support writing an in-memory DataFrame to ClickHouse via to_clickhouse#592wudidapaopao wants to merge 2 commits into
wudidapaopao wants to merge 2 commits into
Conversation
… via to_clickhouse A DataStore wrapping a pure in-memory pandas DataFrame (source_type == "dataframe") was rejected by to_clickhouse() even with an explicit target, because both the _require_clickhouse_for_writeback guard and _get_adapter() key off the *source* type, not the target. The DataFrame-upload machinery itself (local _execute() -> DESCRIBE Python(df) -> INSERT ... FROM Python(df)) never needed a ClickHouse source, so this was a guard/adapter coupling, not a technical limit. to_clickhouse() now detects a pure-DataFrame source and, when an explicit host is given, routes through the existing DataFrame-upload path with the target treated as ClickHouse (via a scoped source_type override in _target_server_context that is restored even on exception). An explicit, non-empty host is required for a df source (no source server to default to); omitting it raises a clear DataStoreError. Scope is intentionally limited to to_clickhouse: - create_view / create_materialized_view / save() still reject a pure df (they need a server-side SQL source a local df cannot provide); - non-ClickHouse sources (mysql/postgresql) remain rejected by _require_clickhouse_for_writeback. Adds TestToClickHouseLocalDataFrame: fail/replace/append with exact-value and row-order assertions, engine/order_by(list)/partition_by, index/index_label, a Pandas transform, the no-host/empty-host guard, source_type restore, and scope-boundary tests proving create_view / create_materialized_view (with an explicit host) and save(view/MV) still reject a local df. Plus a cross-server case asserting the table lands on the target server only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@chibugai, review this PR. |
|
Reviewed (OCR opus-4-8, What I checked on the two files (
Test coverage is thorough (fail/replace/append, engine+partition, index_label, transform-flow-through, cross-server isolation). LGTM from my side. |
…ickhouse
The local-DataFrame relaxation only recognized source_type=='dataframe' (the
DataStore(df) constructor). DataStore.from_df / from_dataframe wrap a frame
differently — they keep the default source_type ('chdb') and stash the frame
in _source_df — so to_clickhouse(host=...) on them hit the
_require_clickhouse_for_writeback guard ("source is 'chdb'") even though the
upload path works for them identically.
Broaden the detection to "wraps a local frame AND has no source host":
wraps_local_df = source_type=='dataframe' or _source_df is not None
writing_local_df = wraps_local_df and not _remote_params.get('host')
The "no source host" gate is essential: a pandas-only op (e.g. transform) on a
ClickHouse-sourced store also caches into _source_df, but it keeps its source
host and must keep defaulting its writeback target to that source server rather
than erroring on a missing host. Without the gate, the existing
TestToClickHouseDataFrameUpload cases (CH source + transform, no explicit host)
regress.
Adds tests: from_df / from_dataframe -> to_clickhouse write with exact value &
order checks + source_type restore, and the no-host guard on the from_df
representation. Full writeback suite: 75 passed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
DataStore.to_clickhouse()now accepts a pure in-memory pandas DataFrame (source_type == "dataframe") as the source, when an explicit targethostis given. Previously this was rejected even with a host.Why
The DataFrame-upload path (
_execute()locally →DESCRIBE Python(df)→INSERT … FROM Python(df)) never needed a ClickHouse source — it already runs for ClickHouse-sourced stores whose pipeline has Pandas-only ops. A pure-DataFrame store was blocked only by two couplings keyed on the source type rather than the target:_require_clickhouse_for_writebackguard (rejectssource_type != clickhouse), and_get_adapter()(no adapter for"dataframe").The
host=parameter sets where to write but didn't change either decision, so an explicit ClickHouse target was ignored. This is a guard/adapter coupling, not a technical limit.How
to_clickhouse()detects a pure-DataFrame source and, when an explicit host is given, runs the existing DataFrame-upload path with the target treated as ClickHouse — via a scopedsource_typeoverride in_target_server_contextthat is restored even on exception. An explicit, non-empty host is required (a df has no source server to default to); omitting it raises a clearDataStoreError.Scope is deliberately narrow:
create_view/create_materialized_view/save()still reject a pure df — they need a server-side SQL source a local df cannot provide (a VIEW stores a query, an MV is an insert-trigger on a server-side table;Python(df)is local-only and can't be persisted into a server-side definition)._require_clickhouse_for_writeback.Tests
New
TestToClickHouseLocalDataFrame(+ a cross-server case), all asserting exact values and row order per the repo's testing principles:fail/replace(new + refuse-existing) /append, with value + order checksengine/order_by(list) /partition_byindex=True+index_labeltransformon a local dfDataStoreErrorsource_typerestored after success and after a raised exceptioncreate_view/create_materialized_view(with explicit host) andsave(view/MV)still reject a local dfVerified locally against the
.github/ci/clickhouse-pair.ymlpair: fulltest_writeback.py= 73 passed (60 existing + 13 new), no regressions.🤖 Generated with Claude Code