Fix multiple issues with embedding storage and export by caufieldjh · Pull Request #15 · bioepic-data/trowel

caufieldjh · 2026-03-05T23:36:51Z

This pull request improves the robustness and flexibility of the embedding generation and export utilities, especially when working with CurateGPT and DuckDB backends. It addresses inconsistencies in backend result formats, enhances error handling, and expands test coverage to ensure reliable CSV export and embedding workflows. Additionally, it introduces integration test support and refines the CI workflow to better handle optional, external-service-dependent tests.

Enhancements to embedding export and generation:

Added the _normalize_store_result helper to handle different result shapes returned by CurateGPT backends (dicts and tuples), ensuring consistent document extraction for CSV export.
Improved export_embeddings_to_csv to infer field names from document data if the backend returns empty field names, preventing export failures and ensuring all relevant fields are included. Also, now uses the "duckdb" backend explicitly. [1] [2]
Enhanced error handling in both embedding generation and export functions by checking for missing files or environment variables before proceeding. [1] [2]

Testing improvements:

Added comprehensive tests to cover cases where field names are missing, CurateGPT returns tuple-shaped results (as with DuckDB), and to verify fallback and normalization logic. [1] [2]
Introduced an optional end-to-end integration test for the entire embedding generation and export workflow, gated by environment variables to avoid running in default CI.

Continuous Integration and configuration:

Updated the CI workflow to skip integration tests by default, preventing failures due to missing external dependencies or credentials.
Added pytest configuration to mark integration tests and filter out specific deprecation warnings, keeping CI output clean and allowing opt-in integration test runs.

Minor code and logging improvements:

Improved log formatting and consistency for better readability during embedding generation and export. [1] [2]

These changes make the embedding utilities more reliable and adaptable to backend changes, and ensure that both local and CI testing are robust and maintainable.

caufieldjh added 7 commits March 5, 2026 18:11

Update poetry lock

26384b8

Add fix for embedding generation using wrong store; add tests

bb7fa42

Fix import order for embedding generation utils

01d045b

Type fix

989e623

Add embedding integration tests

f4a754b

Add a fix for varying store shapes received from curategpt

03e40d2

Expand embedding integ test to include read

61020a5

caufieldjh linked an issue Mar 5, 2026 that may be closed by this pull request

export of embeddings fails possibly due to missing "field_names" in store object #14

Closed

caufieldjh mentioned this pull request Mar 5, 2026

export of embeddings fails possibly due to missing "field_names" in store object #14

Closed

Fixes for curategpt import

0e248a5

caufieldjh merged commit 2d6ba38 into main Mar 6, 2026
2 checks passed

caufieldjh deleted the 14-export-of-embeddings-fails-possibly-due-to-missing-field_names-in-store-object branch March 6, 2026 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix multiple issues with embedding storage and export#15

Fix multiple issues with embedding storage and export#15
caufieldjh merged 8 commits into
mainfrom
14-export-of-embeddings-fails-possibly-due-to-missing-field_names-in-store-object

caufieldjh commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

caufieldjh commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant