fix(openai): guard Path(None) when chunk metadata lacks source key (#370)#401
Conversation
doc_metadata.get('source') returns None for older partitions, web-search
results, or chunks from loaders that don't set the field. Path(None)
raises TypeError, which surfaces as a 500 and takes down the entire
chat completion response. Default the missing key to an empty string
before constructing the Path.
Fixes #370
📝 WalkthroughWalkthroughDelegate document link construction to a new stdlib-only helper ChangesSource Metadata Defensive Handling
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@openrag/routers/openai.py`:
- Around line 107-109: The code currently always computes file_url from source
even when source/filename is empty; change the logic around doc_metadata
handling (look for variables source, filename, file_url) so that you only call
request.url_for("static", path=filename) and set file_url when filename is
truthy (non-empty); if filename is falsy, do not include file_url in the
returned metadata/dict (omit the key or leave it undefined) to avoid emitting a
static URL with an empty path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: df766678-5788-4763-b04c-7cb5dd7b6e54
📒 Files selected for processing (1)
openrag/routers/openai.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@openrag/routers/test_openai_prepare_sources.py`:
- Around line 18-31: The tests currently duplicate logic in
_prepare_sources_filename and _emits_file_url which masks regressions; update
the tests to exercise the real router logic by calling the production function
__prepare_sources in openrag/routers/openai.py (or import and use the shared
helper exported by that module) instead of the local mirror helpers, and replace
assertions that use _prepare_sources_filename/_emits_file_url with assertions on
the actual __prepare_sources return values (and its emitted file_url behavior)
so the test verifies real behavior rather than duplicated logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 48bb4704-9049-47b7-a3a1-f5d6009db6ee
📒 Files selected for processing (2)
openrag/routers/openai.pyopenrag/routers/test_openai_prepare_sources.py
…helper Move the per-document link assembly out of __prepare_sources into a stdlib-only routers/source_links.build_document_source_link. The regression test now exercises the real helper instead of a mirrored copy, so it catches production drift. The router passes request-bound url builders to keep request.url_for usage in the HTTP layer.
There was a problem hiding this comment.
🧹 Nitpick comments (3)
openrag/routers/source_links.py (2)
34-39: 💤 Low value
**doc_metadatacan silently override computed keys.Because
**doc_metadatais spread aftersource_type,file_url, andchunk_url, any chunk metadata that happens to carry one of those keys (e.g., a stalechunk_url, asource_typeinjected upstream, or a precomputedfile_url) will overwrite the values just computed here. Today this is mostly theoretical, but it makes the helper’s contract fragile — a future loader adding one of those reserved keys tometadatawould silently break the response shape without any test failing.Consider spreading
doc_metadatafirst so the computed keys win, or popping the reserved keys before merging.♻️ Proposed refactor
return { + **doc_metadata, "source_type": "document", **({"file_url": encoded_url} if encoded_url else {}), "chunk_url": chunk_url_builder(doc_metadata["_id"]), - **doc_metadata, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@openrag/routers/source_links.py` around lines 34 - 39, The returned dict currently spreads doc_metadata last so any reserved keys in doc_metadata (e.g., "source_type", "file_url", "chunk_url") can silently override computed values; update the logic in the function that builds the return (references: doc_metadata, encoded_url, chunk_url_builder) to either spread doc_metadata first and then explicitly set "source_type", "file_url" (when encoded_url), and "chunk_url" so computed keys win, or explicitly remove/pop those reserved keys from doc_metadata before merging; ensure the final object always uses the computed chunk_url_builder(doc_metadata["_id"]) and computed encoded_url-derived file_url.
37-37: 💤 Low value
doc_metadata["_id"]will raiseKeyErroron malformed input.The docstring documents
_idas required, which is fine for the production callers, but the helper is now stdlib-only and unit-testable, so a misformed dict will surface as a bareKeyErrorrather than something the caller can distinguish from a real bug. Since the same defensive treatment that motivated this PR (source→"") was applied here, a symmetric guard (e.g. raise a clearValueError, or accept missing_idand omitchunk_url) would make the contract explicit and easier to reason about.Not a blocker — only worth doing if you want the helper to be strict about its contract.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@openrag/routers/source_links.py` at line 37, The current code calls chunk_url_builder(doc_metadata["_id"]) which will raise a bare KeyError for malformed input; update the helper to explicitly check for the "_id" key before using it and either (a) raise a clear ValueError like "missing required '_id' in doc_metadata" when "_id" is absent, or (b) skip/omit the "chunk_url" entry if you prefer permissive behavior — implement the chosen behavior in the code that builds the dict (the place that currently has "chunk_url": chunk_url_builder(doc_metadata["_id"])) so callers get an explicit contract instead of a KeyError.openrag/routers/test_openai_prepare_sources.py (1)
47-55: 💤 Low valueEncoding coverage is thin — consider one more character class.
test_file_url_emitted_and_encoded_when_source_presentonly verifies space encoding, which is the easy case. A filename containing characters thatquote(safe=":/")won't encode (e.g.:or/appearing in the basename via something exotic) vs. characters that it will (e.g.#,?,%, non-ASCII likeé) would lock in thesafe=":/"choice against accidental changes. Worth one extra parametrized case so the encoding contract is actually pinned down.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@openrag/routers/test_openai_prepare_sources.py` around lines 47 - 55, Add a new parametrized case alongside test_file_url_emitted_and_encoded_when_source_present that uses _build to create links with basenames containing characters that must be percent-encoded (e.g. "#", "?", "%", non-ASCII like "é") and assert the produced link["file_url"] encodes those characters (so the test pins the behavior of quote(safe=":/")). Alternatively, add a new test function (e.g. test_file_url_encodes_reserved_and_non_ascii_chars) that calls _build with a source like "/srv/data/fi#le?.é.pdf" and asserts the expected percent-encoded basename appears in "https://host/static/…".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@openrag/routers/source_links.py`:
- Around line 34-39: The returned dict currently spreads doc_metadata last so
any reserved keys in doc_metadata (e.g., "source_type", "file_url", "chunk_url")
can silently override computed values; update the logic in the function that
builds the return (references: doc_metadata, encoded_url, chunk_url_builder) to
either spread doc_metadata first and then explicitly set "source_type",
"file_url" (when encoded_url), and "chunk_url" so computed keys win, or
explicitly remove/pop those reserved keys from doc_metadata before merging;
ensure the final object always uses the computed
chunk_url_builder(doc_metadata["_id"]) and computed encoded_url-derived
file_url.
- Line 37: The current code calls chunk_url_builder(doc_metadata["_id"]) which
will raise a bare KeyError for malformed input; update the helper to explicitly
check for the "_id" key before using it and either (a) raise a clear ValueError
like "missing required '_id' in doc_metadata" when "_id" is absent, or (b)
skip/omit the "chunk_url" entry if you prefer permissive behavior — implement
the chosen behavior in the code that builds the dict (the place that currently
has "chunk_url": chunk_url_builder(doc_metadata["_id"])) so callers get an
explicit contract instead of a KeyError.
In `@openrag/routers/test_openai_prepare_sources.py`:
- Around line 47-55: Add a new parametrized case alongside
test_file_url_emitted_and_encoded_when_source_present that uses _build to create
links with basenames containing characters that must be percent-encoded (e.g.
"#", "?", "%", non-ASCII like "é") and assert the produced link["file_url"]
encodes those characters (so the test pins the behavior of quote(safe=":/")).
Alternatively, add a new test function (e.g.
test_file_url_encodes_reserved_and_non_ascii_chars) that calls _build with a
source like "/srv/data/fi#le?.é.pdf" and asserts the expected percent-encoded
basename appears in "https://host/static/…".
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8c3571e0-e2ae-4e88-a2b0-d266f12f02a4
📒 Files selected for processing (3)
openrag/routers/openai.pyopenrag/routers/source_links.pyopenrag/routers/test_openai_prepare_sources.py
Summary
routers/openai.py:107(__prepare_sources) builds the displayed filename viaPath(doc_metadata.get(\"source\")).name. When the metadata key is absent,dict.getreturnsNoneandPath(None)raisesTypeError, crashing the response handler with a 500 instead of either omitting the link or showing a placeholder.This affects retrieval results that lack
source(older partitions, web-search results inserted by_prepare_for_web_only, or chunks from loaders that forget to set the field).This PR defaults the missing key to an empty string before constructing the
Path.Test plan
sourcekey and confirm the response is rendered (with empty filename) rather than 500Fixes #370
Summary by CodeRabbit