- All methods returning DataFrame or Series should return Lazy objects (such as
DataStore,LazySeries) - Defer execution until results are actually needed
- Preserve the ability to select the optimal execution engine (pandas vs chDB/SQL) at execution stage
- API style does not determine execution engine: pandas-style, Pythonic, SQL terminology should all compile to the same optimized backend execution
- Final execute stage selects pandas or chDB ExecutionEngine based on config system
- Prohibit explicit calls to
_execute()method - Avoid explicit conversions like
to_df(),to_list(),to_pandas()as much as possible - Execution triggered through natural means:
.columns- get column nameslen()- get length.index- get indexrepr()/print()- display results__iter__- iteration.equals()- comparison
- Do not consider backward compatibility, first priority is architectural simplicity and elegance
- Don't create split class hierarchies for different execution engines
ColumnExpruniformly wraps all expression types- Handle lazy execution through unified
LazySeries,LazyGroupBy, etc. LazyOpsystem uniformly manages all lazy operations- Avoid duplicate definitions, keep code structure clear with single responsibility
Philosophy:
- Discovered problems are opportunities to improve the library
- Analyze problems from an architectural perspective, don't easily modify tests just to pass them
- Using
reset_index()in tests to mask problems = DataStore bug, not correct test writing - Don't obsess over container type differences between DataFrame and DataStore
FORBIDDEN behaviors:
- ❌ Using comments to describe expected behavior without actual assertions
- ❌ Using
print()/loggingwithout corresponding assertions - ❌ Only verifying
len()without verifying actual values - ❌ Writing "verify X" comments but not actually verifying X
- ❌ Using
# TODO: verify lateror similar postponement
REQUIRED:
-
Mirror Code Pattern (DataStore ↔ Pandas)
Test code must mirror DataStore and pandas operations for easy comparison:
# ✅ GOOD - Mirror code pattern # pandas operations pd_df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]}) pd_result = pd_df[pd_df['age'] > 20].sort_values('name') # DataStore operations (mirror of pandas) ds_df = DataStore({'name': ['Alice', 'Bob'], 'age': [25, 30]}) ds_result = ds_df[ds_df['age'] > 20].sort_values('name') # Compare results assert_datastore_equals_pandas(ds_result, pd_result)
-
Complete Output Comparison (Columns + Data + Order)
Comparison must be complete: column names, data values, row order (if pandas operation preserves or defines order)
# Import the utility function from tests.test_utils import assert_datastore_equals_pandas # Full comparison (columns + values + order) assert_datastore_equals_pandas(ds_result, pd_result) # For unordered results (e.g., groupby without sort) assert_datastore_equals_pandas(ds_result, pd_result, check_row_order=False) # Also available: # - assert_column_values_equal(ds_result, pd_result, 'col_name') # - assert_columns_match(ds_result, pd_result) # - assert_row_count_match(ds_result, pd_result)
# For operations that preserve order (filter, map, etc.) pd_result = pd_df[pd_df['value'] > 10] # preserves original order ds_result = ds_df[ds_df['value'] > 10] assert_datastore_equals_pandas(ds_result, pd_result) # For operations with explicit sorting pd_result = pd_df.sort_values('name', ascending=False) ds_result = ds_df.sort_values('name', ascending=False) assert_datastore_equals_pandas(ds_result, pd_result) # For operations with undefined order (groupby without sort) pd_result = pd_df.groupby('category').sum() ds_result = ds_df.groupby('category').sum() # Compare as sets or sort both before comparison assert_datastore_equals_pandas(ds_result, pd_result, check_row_order=False)
-
SQL Execution Verification via Logs
# Capture and verify actual SQL executed log_output = log_capture.getvalue() self.assertIn('WHERE "value" > 30', log_output) # Verify exact SQL clause self.assertIn('Segment 1/', log_output) # Verify segment execution self.assertIn('[Pandas]', log_output) # Verify Pandas ops logged
-
Test Name Must Reflect What Is Being Verified
# BAD def test_complex_pipeline(self): # GOOD def test_sql_pandas_sql_segments_exact_values_and_structure(self):
Self-Check Before Submitting Test:
- Does every comment about "expected behavior" have a corresponding assertion?
- Are actual data values verified, not just lengths?
- Are DataStore and pandas code mirrored (same operations, same style)?
- Is column order verified (not just column names as a set)?
- Is row order verified for order-preserving operations?
- Is segment structure (type, ops, is_first_segment) fully verified?
- Are error messages descriptive for debugging?
- Would this test catch a real bug, or just pass trivially?
Core Philosophy: Users write familiar pandas-style code, backend automatically selects optimal execution engine.