Conversation
There was a problem hiding this comment.
Pull request overview
Fixes issue #1168 where vertipaq_analyzer() returns "Model Summary"["Total Size"] as a string, preventing numeric operations, by introducing a typed (non-formatted) return path.
Changes:
- Import and use
_update_dataframe_datatypes()to cast returned DataFrame columns based onvertipaq_mapdata types. - Return a new
final_dictof DataFrames (int/float/bool/string-typed) instead of the display-formatteddfs[...]. - Attempt to switch the
export == "table"path to usefinal_dict(currently inconsistent/broken).
Comments suppressed due to low confidence (1)
src/sempy_labs/semantic_model/_vertipaq_analyzer.py:1053
export == "table"path is now broken:final_dictstores DataFrames directly (e.g.,final_dict[title] = df), butdf_maptreats each entry as a dict and doesfinal_dict[k]["data"]. Also the keys used indf_mapinclude "Model" whilefinal_dictuses "Model Summary". This will raise at runtime (KeyError/TypeError) and prevent table export. Suggestion: either keep usingdfs = create_dfs(...)for export, or changefinal_dictto use consistent keys and store the same{ "data": df, ... }structure expected bydf_map.
if export == "table":
#dfs = create_dfs(column_formatting="data_type")
print(
f"{icons.in_progress} Saving Vertipaq Analyzer to delta tables in the lakehouse...\n"
)
now = datetime.datetime.now()
# Dataset metadata
df_datasets = fabric.list_datasets(workspace=workspace_id, mode="rest")
configured_by = df_datasets.loc[
df_datasets["Dataset Id"] == dataset_id, "Configured By"
].iloc[0]
(capacity_id, capacity_name) = resolve_workspace_capacity(
workspace=workspace_id
)
base_metadata = {
"Capacity Name": capacity_name,
"Capacity Id": capacity_id,
"Workspace Name": workspace_name,
"Workspace Id": workspace_id,
"Dataset Name": dataset_name,
"Dataset Id": dataset_id,
"Configured By": configured_by,
"RunId": run_id,
"Timestamp": now,
}
df_map = {
k: final_dict[k]["data"]
for k in [
"Columns",
"Tables",
"Partitions",
"Relationships",
"Hierarchies",
"Model",
]
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #dfs = create_dfs(column_formatting="data_type") | ||
|
|
There was a problem hiding this comment.
There is commented-out code left in the export branch (#dfs = create_dfs(column_formatting="data_type")). Since the function’s return/export logic is being reworked, this should be either removed or reinstated with the correct implementation to avoid confusion and keep the export path maintainable.
| #dfs = create_dfs(column_formatting="data_type") |
| # Prepare output for returned dictionary of dataframes and for exported dataframes | ||
| dtype_map = {"string": "string", "long": "int", "double": "float", "bool": "bool"} | ||
| return_sections = { | ||
| "Model": "Model Summary", | ||
| "Tables": "Tables", | ||
| "Partitions": "Partitions", | ||
| "Columns": "Columns", | ||
| "Relationships": "Relationships", | ||
| "Hierarchies": "Hierarchies", | ||
| } | ||
| final_dict = {} | ||
| for name, title in return_sections.items(): | ||
| items = config[name] | ||
| data = items.get("data") | ||
| sort_col = items.get("sortby") | ||
| df = pd.DataFrame(data, columns=list(vertipaq_map[name].keys())) | ||
| if sort_col and sort_col in df.columns: | ||
| df = df.sort_values(by=sort_col, ascending=False).reset_index(drop=True) | ||
| col_types = { |
There was a problem hiding this comment.
In the export is None path, the code now builds DataFrames twice: once in the new final_dict loop and again in create_dfs() for visualization. For large models (especially the Columns section) this doubles DataFrame construction/sorting work. Consider deriving the returned DataFrames from the already-built dfs (or vice-versa) by copying before formatting, so the raw-typed and display-formatted outputs share the same underlying DataFrame build/cleanup steps.
| # Prepare output for returned dictionary of dataframes and for exported dataframes | ||
| dtype_map = {"string": "string", "long": "int", "double": "float", "bool": "bool"} | ||
| return_sections = { | ||
| "Model": "Model Summary", | ||
| "Tables": "Tables", | ||
| "Partitions": "Partitions", | ||
| "Columns": "Columns", | ||
| "Relationships": "Relationships", | ||
| "Hierarchies": "Hierarchies", | ||
| } | ||
| final_dict = {} | ||
| for name, title in return_sections.items(): | ||
| items = config[name] | ||
| data = items.get("data") | ||
| sort_col = items.get("sortby") | ||
| df = pd.DataFrame(data, columns=list(vertipaq_map[name].keys())) | ||
| if sort_col and sort_col in df.columns: | ||
| df = df.sort_values(by=sort_col, ascending=False).reset_index(drop=True) | ||
| col_types = { | ||
| k: dtype_map.get(v["data_type"], "string") | ||
| for k, v in vertipaq_map[name].items() | ||
| if k in df.columns | ||
| } | ||
| _update_dataframe_datatypes(df, col_types) | ||
| final_dict[title] = df | ||
|
|
There was a problem hiding this comment.
The new final_dict construction duplicates the DataFrame-building logic from create_dfs() but does not apply the same schema/row cleanup (e.g., filtering out Type == "RowNumber", dropping "Source Column", removing Direct Lake-only partition columns when not Direct Lake, and removing "Missing Rows" when read_stats_from_data is false). This changes the returned DataFrames compared to what is visualized/previously returned and can reintroduce unwanted rows/columns. Suggestion: reuse the existing cleanup logic (refactor it into a shared helper or run the same filtering/dropping steps before storing each df in final_dict).
#1168