Skip to content

fix(openai): guard Path(None) when chunk metadata lacks source key (#370)#401

Merged
EnjoyBacon7 merged 4 commits into
refactor/hexagonalfrom
fix/370-source-path-none-guard
May 21, 2026
Merged

fix(openai): guard Path(None) when chunk metadata lacks source key (#370)#401
EnjoyBacon7 merged 4 commits into
refactor/hexagonalfrom
fix/370-source-path-none-guard

Conversation

@EnjoyBacon7

@EnjoyBacon7 EnjoyBacon7 commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

routers/openai.py:107 (__prepare_sources) builds the displayed filename via Path(doc_metadata.get(\"source\")).name. When the metadata key is absent, dict.get returns None and Path(None) raises TypeError, crashing the response handler with a 500 instead of either omitting the link or showing a placeholder.

This affects retrieval results that lack source (older partitions, web-search results inserted by _prepare_for_web_only, or chunks from loaders that forget to set the field).

This PR defaults the missing key to an empty string before constructing the Path.

Test plan

  • Issue a chat completion with a chunk whose metadata has no source key and confirm the response is rendered (with empty filename) rather than 500
  • Verify existing chat completions still produce correct filenames

Fixes #370

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling of missing or empty source metadata so filename extraction no longer fails; file links are now included only when a valid source exists, while chunk links remain stable.
  • Tests
    • Added regression tests covering missing, empty, and present source metadata, URL encoding of filenames, and metadata passthrough.

Review Change Stack

doc_metadata.get('source') returns None for older partitions, web-search
results, or chunks from loaders that don't set the field. Path(None)
raises TypeError, which surfaces as a 500 and takes down the entire
chat completion response. Default the missing key to an empty string
before constructing the Path.

Fixes #370
@coderabbitai

coderabbitai Bot commented May 20, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Delegate document link construction to a new stdlib-only helper build_document_source_link; router supplies static_url/chunk_url callables and no longer performs inline Path/quote/url logic. Add tests covering missing, None, empty, and present source cases.

Changes

Source Metadata Defensive Handling

Layer / File(s) Summary
Source link helper (contract and implementation)
openrag/routers/source_links.py
Adds build_document_source_link(doc_metadata, static_url_builder, chunk_url_builder) that derives an optional percent-encoded file_url from source basename, always sets chunk_url from _id, and merges with original metadata.
Router delegates to helper
openrag/routers/openai.py
Removes inline Path/quote/url construction in __prepare_sources; defines static_url and chunk_url callables and calls build_document_source_link for each document. Also trims unused imports.
Regression tests for source handling
openrag/routers/test_openai_prepare_sources.py
Adds pytest helpers and tests validating that missing/None/empty source omit file_url, present source yields encoded basename file_url, chunk_url is stable from _id, and other metadata passes through.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit nibbling at code so neat,
Turns None to blank and spares the fleet.
Links now built by a tidy hand,
Tests hop in to guard the land. 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main fix: guarding Path(None) when chunk metadata lacks the source key, addressing issue #370.
Linked Issues check ✅ Passed The PR fully addresses issue #370 by extracting source-link building logic into build_document_source_link, handling missing/None/empty source values, and including comprehensive tests.
Out of Scope Changes check ✅ Passed All changes align with issue #370: refactoring openai.py, creating source_links.py helper module, and adding corresponding tests. No out-of-scope modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/370-source-path-none-guard

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@openrag/routers/openai.py`:
- Around line 107-109: The code currently always computes file_url from source
even when source/filename is empty; change the logic around doc_metadata
handling (look for variables source, filename, file_url) so that you only call
request.url_for("static", path=filename) and set file_url when filename is
truthy (non-empty); if filename is falsy, do not include file_url in the
returned metadata/dict (omit the key or leave it undefined) to avoid emitting a
static URL with an empty path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: df766678-5788-4763-b04c-7cb5dd7b6e54

📥 Commits

Reviewing files that changed from the base of the PR and between 470c30e and f78d3b4.

📒 Files selected for processing (1)
  • openrag/routers/openai.py

Comment thread openrag/routers/openai.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@openrag/routers/test_openai_prepare_sources.py`:
- Around line 18-31: The tests currently duplicate logic in
_prepare_sources_filename and _emits_file_url which masks regressions; update
the tests to exercise the real router logic by calling the production function
__prepare_sources in openrag/routers/openai.py (or import and use the shared
helper exported by that module) instead of the local mirror helpers, and replace
assertions that use _prepare_sources_filename/_emits_file_url with assertions on
the actual __prepare_sources return values (and its emitted file_url behavior)
so the test verifies real behavior rather than duplicated logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 48bb4704-9049-47b7-a3a1-f5d6009db6ee

📥 Commits

Reviewing files that changed from the base of the PR and between f78d3b4 and 1270952.

📒 Files selected for processing (2)
  • openrag/routers/openai.py
  • openrag/routers/test_openai_prepare_sources.py

Comment thread openrag/routers/test_openai_prepare_sources.py Outdated
…helper

Move the per-document link assembly out of __prepare_sources into a
stdlib-only routers/source_links.build_document_source_link. The
regression test now exercises the real helper instead of a mirrored
copy, so it catches production drift. The router passes request-bound
url builders to keep request.url_for usage in the HTTP layer.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
openrag/routers/source_links.py (2)

34-39: 💤 Low value

**doc_metadata can silently override computed keys.

Because **doc_metadata is spread after source_type, file_url, and chunk_url, any chunk metadata that happens to carry one of those keys (e.g., a stale chunk_url, a source_type injected upstream, or a precomputed file_url) will overwrite the values just computed here. Today this is mostly theoretical, but it makes the helper’s contract fragile — a future loader adding one of those reserved keys to metadata would silently break the response shape without any test failing.

Consider spreading doc_metadata first so the computed keys win, or popping the reserved keys before merging.

♻️ Proposed refactor
     return {
+        **doc_metadata,
         "source_type": "document",
         **({"file_url": encoded_url} if encoded_url else {}),
         "chunk_url": chunk_url_builder(doc_metadata["_id"]),
-        **doc_metadata,
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@openrag/routers/source_links.py` around lines 34 - 39, The returned dict
currently spreads doc_metadata last so any reserved keys in doc_metadata (e.g.,
"source_type", "file_url", "chunk_url") can silently override computed values;
update the logic in the function that builds the return (references:
doc_metadata, encoded_url, chunk_url_builder) to either spread doc_metadata
first and then explicitly set "source_type", "file_url" (when encoded_url), and
"chunk_url" so computed keys win, or explicitly remove/pop those reserved keys
from doc_metadata before merging; ensure the final object always uses the
computed chunk_url_builder(doc_metadata["_id"]) and computed encoded_url-derived
file_url.

37-37: 💤 Low value

doc_metadata["_id"] will raise KeyError on malformed input.

The docstring documents _id as required, which is fine for the production callers, but the helper is now stdlib-only and unit-testable, so a misformed dict will surface as a bare KeyError rather than something the caller can distinguish from a real bug. Since the same defensive treatment that motivated this PR (source"") was applied here, a symmetric guard (e.g. raise a clear ValueError, or accept missing _id and omit chunk_url) would make the contract explicit and easier to reason about.

Not a blocker — only worth doing if you want the helper to be strict about its contract.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@openrag/routers/source_links.py` at line 37, The current code calls
chunk_url_builder(doc_metadata["_id"]) which will raise a bare KeyError for
malformed input; update the helper to explicitly check for the "_id" key before
using it and either (a) raise a clear ValueError like "missing required '_id' in
doc_metadata" when "_id" is absent, or (b) skip/omit the "chunk_url" entry if
you prefer permissive behavior — implement the chosen behavior in the code that
builds the dict (the place that currently has "chunk_url":
chunk_url_builder(doc_metadata["_id"])) so callers get an explicit contract
instead of a KeyError.
openrag/routers/test_openai_prepare_sources.py (1)

47-55: 💤 Low value

Encoding coverage is thin — consider one more character class.

test_file_url_emitted_and_encoded_when_source_present only verifies space encoding, which is the easy case. A filename containing characters that quote(safe=":/") won't encode (e.g. : or / appearing in the basename via something exotic) vs. characters that it will (e.g. #, ?, %, non-ASCII like é) would lock in the safe=":/" choice against accidental changes. Worth one extra parametrized case so the encoding contract is actually pinned down.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@openrag/routers/test_openai_prepare_sources.py` around lines 47 - 55, Add a
new parametrized case alongside
test_file_url_emitted_and_encoded_when_source_present that uses _build to create
links with basenames containing characters that must be percent-encoded (e.g.
"#", "?", "%", non-ASCII like "é") and assert the produced link["file_url"]
encodes those characters (so the test pins the behavior of quote(safe=":/")).
Alternatively, add a new test function (e.g.
test_file_url_encodes_reserved_and_non_ascii_chars) that calls _build with a
source like "/srv/data/fi#le?.é.pdf" and asserts the expected percent-encoded
basename appears in "https://host/static/…".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@openrag/routers/source_links.py`:
- Around line 34-39: The returned dict currently spreads doc_metadata last so
any reserved keys in doc_metadata (e.g., "source_type", "file_url", "chunk_url")
can silently override computed values; update the logic in the function that
builds the return (references: doc_metadata, encoded_url, chunk_url_builder) to
either spread doc_metadata first and then explicitly set "source_type",
"file_url" (when encoded_url), and "chunk_url" so computed keys win, or
explicitly remove/pop those reserved keys from doc_metadata before merging;
ensure the final object always uses the computed
chunk_url_builder(doc_metadata["_id"]) and computed encoded_url-derived
file_url.
- Line 37: The current code calls chunk_url_builder(doc_metadata["_id"]) which
will raise a bare KeyError for malformed input; update the helper to explicitly
check for the "_id" key before using it and either (a) raise a clear ValueError
like "missing required '_id' in doc_metadata" when "_id" is absent, or (b)
skip/omit the "chunk_url" entry if you prefer permissive behavior — implement
the chosen behavior in the code that builds the dict (the place that currently
has "chunk_url": chunk_url_builder(doc_metadata["_id"])) so callers get an
explicit contract instead of a KeyError.

In `@openrag/routers/test_openai_prepare_sources.py`:
- Around line 47-55: Add a new parametrized case alongside
test_file_url_emitted_and_encoded_when_source_present that uses _build to create
links with basenames containing characters that must be percent-encoded (e.g.
"#", "?", "%", non-ASCII like "é") and assert the produced link["file_url"]
encodes those characters (so the test pins the behavior of quote(safe=":/")).
Alternatively, add a new test function (e.g.
test_file_url_encodes_reserved_and_non_ascii_chars) that calls _build with a
source like "/srv/data/fi#le?.é.pdf" and asserts the expected percent-encoded
basename appears in "https://host/static/…".

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8c3571e0-e2ae-4e88-a2b0-d266f12f02a4

📥 Commits

Reviewing files that changed from the base of the PR and between 1270952 and 7f149d6.

📒 Files selected for processing (3)
  • openrag/routers/openai.py
  • openrag/routers/source_links.py
  • openrag/routers/test_openai_prepare_sources.py

@EnjoyBacon7 EnjoyBacon7 merged commit 6e2e071 into refactor/hexagonal May 21, 2026
6 checks passed
@EnjoyBacon7 EnjoyBacon7 deleted the fix/370-source-path-none-guard branch May 21, 2026 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant