This document describes the core architecture, design principles, and development philosophy of DataStore. It is intended for contributors and developers who want to understand the internals.
Users write familiar pandas-style code, the backend automatically selects the optimal execution engine.
DataStore bridges the gap between pandas' intuitive API and SQL's powerful query optimization. Users get the best of both worlds without changing their coding habits.
The diagram above shows the four-layer architecture:
- User API: pandas-like interface that returns lazy objects
- Lazy Operation Chain: Operations are recorded, not executed
- Execution Trigger & Planning: Natural triggers invoke QueryPlanner to create execution segments
- Segmented Execution: Each segment routes to optimal engine (chDB or Pandas), with intermediate results cached
We deeply respect the pandas ecosystem and the expertise data scientists have built over years. Pandas has established itself as the de facto standard for data manipulation in Python, and millions of developers are fluent in its API.
Our goal is NOT to replace pandas, but to provide a performance-optimized alternative that honors pandas conventions. When a user writes df.groupby('x').mean(), they should get the same result whether using pandas or DataStore—just faster for large datasets.
Traditional pandas executes operations eagerly—every operation immediately processes data. DataStore takes a different approach:
# Pandas: Each line executes immediately
df = pd.read_csv("huge.csv") # Load entire file
df = df[df['age'] > 25] # Filter (creates new DataFrame)
df = df.groupby('city')['salary'] # Prepare groupby
result = df.mean() # Execute groupby
# DataStore: Operations are lazy until needed
ds = DataStore.from_file("huge.csv") # Just records the source
ds = ds.filter(ds.age > 25) # Records filter condition
ds = ds.groupby('city')['salary'] # Records groupby
result = ds.mean() # Still lazy!
print(result) # NOW executes - as optimized SQL:
# SELECT city, avg(salary) FROM file
# WHERE age > 25 GROUP BY cityWhy this matters:
- Cross-row operations (filter, groupby, aggregations) compile to SQL
- chDB/ClickHouse's columnar engine optimizes these operations
- Only the final result is materialized, not intermediate DataFrames
Data exploration is iterative. You query, inspect, adjust, repeat. DataStore caches intermediate results:
ds = DataStore.from_file("data.csv")
# First query - executes SQL
ds.filter(ds.age > 25).head() # SQL executed, result cached
# Same filter - uses cache
ds.filter(ds.age > 25).describe() # No SQL, uses cached result
ds.filter(ds.age > 25)['salary'].mean() # No SQL, uses cached resultWe do NOT guarantee 100% pandas syntax compatibility. That's not our goal and would be impractical.
Instead, we take a pragmatic approach:
-
Extensive Testing: We run compatibility tests using
import datastore as pdagainst real-world pandas code (including Kaggle notebooks) to identify gaps. -
Cover Common Patterns: We prioritize implementing the pandas operations that appear most frequently in data analysis workflows.
-
Minimal Migration: Our goal is that users can migrate existing code with minimal changes—ideally just changing the import statement.
-
Document Differences: When behavior differs from pandas, we document it clearly.
# Ideal migration path
- import pandas as pd
+ import datastore as pd
# Most code should work unchanged
df = pd.read_csv("data.csv")
df = df[df['value'] > 0]
result = df.groupby('category').agg({'value': 'sum'})Not everyone loves pandas syntax. DataStore offers two equivalent approaches:
Pandas-style (familiar to pandas users):
import datastore as pd
df = pd.read_csv("data.csv")
result = df[df['age'] > 25].groupby('city')['salary'].mean()Fluent SQL-style (explicit and Pythonic):
from datastore import DataStore
ds = DataStore.from_file("data.csv")
result = (ds
.filter(ds.age > 25)
.select('city', 'salary')
.groupby('city')
.agg({'salary': 'mean'})
.orderby('salary', ascending=False)
.to_df())Both styles should compile to the same optimized SQL - API style must not determine the execution engine. Choose based on your preference:
- Pandas-style: Minimal migration, familiar syntax
- Fluent-style: Explicit intent, better readability for complex queries, IDE auto-completion
Note: Full SQL compilation for pandas-style groupby is still in development. This is a known limitation we're actively working to resolve.
ClickHouse is a modern columnar database optimized for analytical queries:
- Columnar Storage: Only reads columns you need
- Vectorized Execution: Processes data in batches
- Compression: Efficient storage and I/O
- SQL Optimization: Query planner optimizes your operations
- 100+ File Formats: Parquet, CSV, JSON, ORC, Avro, and 80+ more
- 20+ Data Sources: S3, MySQL, PostgreSQL, MongoDB, Iceberg, Delta Lake, etc.
- 334 Built-in Functions: String, datetime, geo, URL, IP, JSON, array, and more
By compiling pandas operations to ClickHouse SQL, DataStore gives you these optimizations for free.
A key capability that makes DataStore fast is chDB's zero-copy data exchange with pandas:
pandas DataFrame ←────── native ──────→ chDB (ClickHouse)
(zero-copy)
When you call ds.to_df(), chDB returns data directly as pandas DataFrame without copying memory. Similarly, when using DataStore.from_df(df), the DataFrame is passed to chDB's Python() table function which can query it directly without serialization overhead.
This is critical for exploratory workflows where you frequently switch between pandas operations and SQL queries:
ds = DataStore.from_file("data.csv")
df = ds.filter(ds.age > 25).to_df() # Zero-copy: chDB → pandas
df['new_col'] = df['a'] * 2 # Pure pandas operation
ds2 = DataStore.from_df(df) # Zero-copy: pandas → chDB
result = ds2.filter(ds2.new_col > 100).to_df() # Back to chDB| Aspect | DataStore | Polars | DuckDB |
|---|---|---|---|
| Goal | pandas API + ClickHouse power | New high-perf API | SQL-first analytics |
| Learning Curve | Low (know pandas? you're set) | Medium (new API + LazyFrame) | Medium (SQL-first) |
| Ecosystem | ClickHouse (100+ formats, 20+ sources) | ~10 formats, ~5 sources | ~15 formats, ~10 sources |
| Zero-Copy pandas | ✅ pandas ↔ chDB (native) | ❌ (copy required) | ✅ via Arrow |
| Lazy Execution | ✅ Automatic | ✅ Automatic | |
| SQL Support | ✅ Full ClickHouse SQL | ✅ Full | |
| Unique Strength | pandas comfort + ClickHouse scale | Rust-based, memory efficient | Embedded SQL engine |
Every method that returns DataFrame or Series should return a Lazy object.
# All these return lazy objects - no execution happens yet
ds['column'] # → ColumnExpr (lazy)
ds['column'].str.upper() # → ColumnExpr (lazy)
ds['column'].mean() # → LazyAggregate (lazy)
ds['column'].head(5) # → LazySeries (lazy)
ds.groupby('x').size() # → LazySeries (lazy)
ds['a'] > 5 # → ColumnExpr wrapping Condition (lazy)Why?
- Defers execution until results are truly needed
- Preserves the ability to choose the optimal execution engine at execution time
- Enables query optimization by analyzing the full operation chain
- At execution time, the config system determines whether to use pandas or chDB
ExecutionEngine
Explicitly calling _execute() in code and tests is forbidden.
Execution should be triggered through natural interactions:
| Trigger | Description |
|---|---|
.values |
Access underlying numpy array |
.index |
Access index |
repr() / __repr__ |
Display in notebook/REPL |
__iter__ |
Iteration |
len() |
Get length |
print() |
Print output |
to_df() / to_pandas() |
Explicit conversion |
# ✅ Good: Natural triggers
result = ds['age'].mean()
print(result) # Triggers execution via __repr__
np.testing.assert_array_equal(result.values, expected) # Triggers via .values
# ❌ Bad: Explicit execution
result = ds['age'].mean()
result._execute() # FORBIDDENTesting Convention:
# ✅ Use numpy testing with natural triggers
np.testing.assert_array_equal(result.values, expected.values)
# ❌ Avoid pandas testing that may not recognize lazy objects
pd.testing.assert_series_equal(result, expected) # May fail with lazy objectsBackward compatibility is NOT a priority. Architectural elegance IS.
- First priority: Clean, simple, elegant architecture
- Avoid duplicate definitions (e.g.,
value_countsshould have ONE implementation) - Single responsibility for each class
- Clear naming conventions (e.g.,
LazySeriesnotLazySeriesMethod)
Example: Unified Naming
# ✅ Good: Clear, unified naming
class LazySeries:
"""Wraps any Series method call for lazy evaluation."""
pass
# ❌ Bad: Confusing, redundant naming
class LazySeriesMethod: # Redundant "Method"
class LazySlice: # Split responsibility - merged into LazySeriesDo not create split class hierarchies for different execution engines.
ColumnExpruniformly wraps ALL expression types (includingCondition)LazySerieshandles all deferred Series method executions- No separate
BoolColumnExpr- comparisons returnColumnExprwrappingCondition
# ✅ Good: Unified approach
ds['age'] > 25 # Returns ColumnExpr(Condition)
(ds['age'] > 25).value_counts() # Works! ColumnExpr handles it
# ❌ Bad: Fragmented approach (rejected design)
ds['age'] > 25 # Returns BoolColumnExpr (separate class)
# Now need to duplicate methods in BoolColumnExpr...Why unified?
- Less code duplication
- Consistent behavior
- Easier to maintain
- Single code path to debug
Collect and analyze errors during execution. Test failures are valuable signals.
When a test fails:
- First: Analyze if it's a library issue that should be fixed
- Ask: Is this a feature we should implement but haven't?
- Never: Modify tests just to make them pass
# Test fails: ds['col'].str.contains('x', regex=True) doesn't work
# ❌ Bad response: Comment out the test or change expected behavior
# def test_contains():
# pass # TODO: fix later
# ✅ Good response: Analyze the root cause
# - Is regex=True not implemented? → Implement it
# - Is the SQL generation wrong? → Fix the SQL builder
# - Is this a chDB limitation? → Document it and provide workaroundExpression (base)
├── Field # Column reference: ds['column']
├── Literal # Constant value: 42, 'hello'
├── Function # SQL function: upper(x), sum(x)
│ └── AggregateFunction # Aggregate: sum, avg, count
├── ArithmeticExpression # Math: a + b, a * b
├── Condition # Boolean: a > b, a == b
│ ├── BinaryCondition # Two operands: a > b
│ ├── CompoundCondition # Combined: (a > b) & (c < d)
│ └── UnaryCondition # Single: IS NULL
├── DateTimePropertyExpr # dt accessor: ds['date'].dt.year
└── DateTimeMethodExpr # dt method: ds['date'].dt.strftime('%Y')
Lazy Objects
├── ColumnExpr # Wraps Expression, provides pandas-like API
├── LazySeries # Deferred Series method execution
├── LazyAggregate # Deferred aggregate (mean, sum, etc.)
├── LazyCondition # Dual SQL/pandas condition
└── LazyGroupBy # Deferred groupby operations
Accessors (via .str, .dt, .arr, etc.)
├── StringAccessor # String functions
├── DateTimeAccessor # DateTime functions
├── ArrayAccessor # Array functions (ClickHouse-specific)
├── JsonAccessor # JSON functions (ClickHouse-specific)
├── UrlAccessor # URL functions (ClickHouse-specific)
├── IpAccessor # IP functions (ClickHouse-specific)
└── GeoAccessor # Geo functions (ClickHouse-specific)
User Code Lazy Building Execution
─────────────────────────────────────────────────────────────────────
ds['age'] → ColumnExpr
.filter(ds.age > 25) → DataStore (SQL WHERE)
.str.upper() → ColumnExpr
.groupby('dept') → LazyGroupBy
.mean() → LazySeries
.values ──────────────────────────────→ Execute!
↓
Config check
↓
┌───────┴───────┐
│ │
chDB pandas
(SQL) (in-memory)
The config module controls execution behavior:
from datastore import config
# Set default execution engine
config.default_engine = ExecutionEngine.CHDB # or ExecutionEngine.PANDAS
# Configure per-function engine
config.function_config.use_pandas('strftime') # Use pandas for strftime
config.function_config.use_chdb('sum') # Use chDB for sum
# Profiling
config.profiling_enabled = TrueDataStore supports arbitrary mixing of SQL and pandas operations:
SQL ops (lazy) → Pandas op (triggers) → SQL on DataFrame → Result
↓ ↓ ↓
Build query Execute SQL Use chDB Python()
Cache result table function
result = (ds
.filter(ds.price > 100) # SQL (lazy)
.add_prefix('sales_') # Pandas (executes SQL, caches)
.filter(ds.sales_revenue > 1000) # SQL on cached DataFrame!
.fillna(0) # Pandas on cached
.to_df()) # Return result- Register in
function_definitions.py:
@register_function(
name='my_method',
clickhouse_name='myClickHouseFunc',
func_type=FunctionType.SCALAR,
category=FunctionCategory.STRING,
doc='Description of what it does.',
)
def _build_my_method(expr, arg1, alias=None):
from .functions import Function
from .expressions import Literal
return Function('myClickHouseFunc', expr, Literal(arg1), alias=alias)-
Methods are auto-injected into
StringAccessorvia the registry. -
If it needs execution (returns DataFrame/changes structure), implement in
ColumnExprStringAccessor:
def my_method(self, ...):
series = self._execute_series()
result = series.str.my_method(...)
return DataStore.from_df(result)- Return
LazySeriesfrom the method:
def my_operation(self):
return LazySeries(
datastore=self._datastore,
method_name='my_operation',
method_args=(),
method_kwargs={},
source_expr=self._expr
)- Ensure
LazySeriesknows how to execute it (usually automatic via pandas delegation).
# ✅ Test lazy behavior
result = ds['col'].mean()
assert isinstance(result, LazyAggregate)
# ✅ Use natural triggers for execution
np.testing.assert_array_equal(result.values, expected_values)
# ✅ Test both execution engines when relevant
with use_pandas():
pandas_result = ds['col'].mean().values
with use_chdb():
chdb_result = ds['col'].mean().values
np.testing.assert_allclose(pandas_result, chdb_result)# ❌ Don't call _execute() directly
result._execute()
# ❌ Don't modify tests to pass without understanding why they fail
# (commented out test)
# ❌ Don't assume execution engine
# Tests should work with both pandas and chDB when possibleWhen encountering an error:
-
Categorize: Is this a user error, library bug, or missing feature?
-
Analyze:
- Check if similar operations work
- Review the execution path
- Identify where it diverges from expected behavior
-
Document: If it's a limitation, document it clearly
-
Implement: If it's a missing feature we should have, implement it
-
Never: Just suppress or work around without understanding
datastore/
├── __init__.py # Public API exports
├── core.py # DataStore class, main entry point
├── column_expr.py # ColumnExpr and related classes
├── expressions.py # Expression base classes
├── conditions.py # Condition classes
├── functions.py # Function classes
├── function_definitions.py # Function registry definitions
├── function_registry.py # Function registration system
├── lazy_result.py # LazySeries, LazyCondition, etc.
├── lazy_ops.py # Lazy operation utilities
├── groupby.py # LazyGroupBy
├── config.py # Configuration system
├── executor.py # Execution engine
├── expression_evaluator.py # Expression evaluation
├── pandas_compat.py # PandasCompatMixin
├── pandas_api.py # Module-level pandas functions
├── connection.py # chDB connection
└── accessors/ # Accessor classes
├── string.py
├── datetime.py
├── array.py
├── json.py
├── url.py
├── ip.py
└── geo.py
| Principle | Description |
|---|---|
| Lazy First | Every DataFrame/Series-returning method returns a lazy object |
| Natural Triggers | Execution via .values, repr(), etc. - never explicit _execute() |
| Elegance > Compatibility | Clean architecture over backward compatibility |
| Unified Design | Single class hierarchy, no fragmentation by engine |
| Cherish Failures | Analyze test failures deeply, don't just fix tests |
The goal is simple: pandas API comfort with SQL performance, achieved through elegant lazy evaluation.
