Skip to content

Fix multiple issues with embedding storage and export#15

Merged
caufieldjh merged 8 commits into
mainfrom
14-export-of-embeddings-fails-possibly-due-to-missing-field_names-in-store-object
Mar 6, 2026
Merged

Fix multiple issues with embedding storage and export#15
caufieldjh merged 8 commits into
mainfrom
14-export-of-embeddings-fails-possibly-due-to-missing-field_names-in-store-object

Conversation

@caufieldjh

Copy link
Copy Markdown
Contributor

This pull request improves the robustness and flexibility of the embedding generation and export utilities, especially when working with CurateGPT and DuckDB backends. It addresses inconsistencies in backend result formats, enhances error handling, and expands test coverage to ensure reliable CSV export and embedding workflows. Additionally, it introduces integration test support and refines the CI workflow to better handle optional, external-service-dependent tests.

Enhancements to embedding export and generation:

  • Added the _normalize_store_result helper to handle different result shapes returned by CurateGPT backends (dicts and tuples), ensuring consistent document extraction for CSV export.
  • Improved export_embeddings_to_csv to infer field names from document data if the backend returns empty field names, preventing export failures and ensuring all relevant fields are included. Also, now uses the "duckdb" backend explicitly. [1] [2]
  • Enhanced error handling in both embedding generation and export functions by checking for missing files or environment variables before proceeding. [1] [2]

Testing improvements:

  • Added comprehensive tests to cover cases where field names are missing, CurateGPT returns tuple-shaped results (as with DuckDB), and to verify fallback and normalization logic. [1] [2]
  • Introduced an optional end-to-end integration test for the entire embedding generation and export workflow, gated by environment variables to avoid running in default CI.

Continuous Integration and configuration:

  • Updated the CI workflow to skip integration tests by default, preventing failures due to missing external dependencies or credentials.
  • Added pytest configuration to mark integration tests and filter out specific deprecation warnings, keeping CI output clean and allowing opt-in integration test runs.

Minor code and logging improvements:

  • Improved log formatting and consistency for better readability during embedding generation and export. [1] [2]

These changes make the embedding utilities more reliable and adaptable to backend changes, and ensure that both local and CI testing are robust and maintainable.

@caufieldjh caufieldjh merged commit 2d6ba38 into main Mar 6, 2026
2 checks passed
@caufieldjh caufieldjh deleted the 14-export-of-embeddings-fails-possibly-due-to-missing-field_names-in-store-object branch March 6, 2026 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

export of embeddings fails possibly due to missing "field_names" in store object

1 participant