feat: Addressing bug in load_cdc_records, adding test suite#126
feat: Addressing bug in load_cdc_records, adding test suite#126
Conversation
| how="diagonal", | ||
| ).cast(orig_cast) | ||
|
|
||
| if existing_rows.height > 0: |
There was a problem hiding this comment.
Code style opinion: some of these conditionals are best left out, for the sake of readability and consistency; e.g., here there isn't a special case where existing_rows is non-empty which needs particular handling, rather there's an operation that will always be done to the contents of existing_rows, and it's a no-op if existing_rows is empty. Letting the same code run on empty and non-empty code is simpler to think about and gives fewer opportunities for edge-case-specific bugs to come up.
I think similar arguments can be made for the conditionals on 397, 426, 437, 460 and 466, assuming the respective code behaves as expected on empty data. (429 gets a pass because it looks like that does an operation on cdc_df that would be expensive even if missing_keys is empty.)
In this case, also, if you take out this conditional (and the one on 397) you can also take out the separate definition of new_update_rows on 396.
src/odin/generate/cubic/ods_fact.py
Outdated
| cdc_seq_max=str(cdc_df.get_column("header__change_seq").max()), | ||
| all_cdc_frames: list[pl.DataFrame] = [] | ||
| current_min_seq = max_fact_seq | ||
| max_load_records = 10_000 |
There was a problem hiding this comment.
Code nit: Constants should be moved to the top of the file and given an all-caps name.
There was a problem hiding this comment.
Made a MAX_LOAD_RECORDS constant (which max_load_records loads, as it is not a constant)
| all_cdc_frames: list[pl.DataFrame] = [] | ||
| current_min_seq = max_fact_seq | ||
| max_load_records = 10_000 | ||
| for _ in range(11): |
There was a problem hiding this comment.
Similarly; do we know why this is 11?
There was a problem hiding this comment.
This is pretty arbitrary, but seemed to be working ok before
src/odin/generate/cubic/ods_fact.py
Outdated
| _, max_odin_index = ds_metadata_min_max(fact_ds, "odin_index") | ||
|
|
||
| # Log initial fact table state | ||
| init_log = ProcessLog( |
There was a problem hiding this comment.
It's a pre-existing issue, but while we're fixing this function we can also fix how ProcessLog is being mis-used here. Should have one ProcessLog for the function, with further add_metadata(), .complete() and .fail() as appropriate.
There was a problem hiding this comment.
Good call, I consolidated the existing log functions into one
src/odin/generate/cubic/ods_fact.py
Outdated
| # Load fact dataset and get current max sequence | ||
| s3_objects = list_objects(f"s3://{self.s3_export}/", in_filter=".parquet") | ||
| # --- Step 1: Load current fact table state --- | ||
| fact_ds = ds_from_path(f"s3://{self.s3_export}/") |
There was a problem hiding this comment.
Nit: should use the S3 directory path utility function from earlier.
There was a problem hiding this comment.
Fixed this to use your S3 directory path utility function
…onstants, simplifying logic around empty updates
Background
ods_fact.pyhas historically maintained 3 separate accumulators (insert_df,update_df,delete_df) for cdc operations that are incrementally built viacdc_to_fact(). This interleaved approach made it difficult to reason about operation conflicts, which, combined with a complete lack of unit tests, led to a bug where keys that were updated and subsequently deleted in an update batch were getting reinserted after the delete, which we saw impact the data in the unsettled transactions fact tables.Solution
This PR does 2 things to address this and prevent similar errors from being reintroduced in the future:
load_cdc_records()by accumulating all CDC records as raw data first, then resolving each key to its final operation (I/U/D) in one passHere is the order of operations for the new structure (per the new docstring):
Important to note: since updates are sparse (not every column need be updated at once), when the final operation is U, all updates are applied in sequence. If a key is inserted and updated within a single batch, the insert acts as a base for the subsequent updates. When I or D operations are last, only the last operation for this key need be applied.
Post-deploy operations