fix(embedder): plumb max_model_len into model-endpoint embedders (default 2047)#587
fix(embedder): plumb max_model_len into model-endpoint embedders (default 2047)#587Ahmath-Gadji wants to merge 1 commit into
Conversation
…ault 2047) The DB-backed model-endpoint config only carries endpoint/model_name/ batch_size/timeout/extra, so both the API container factory and the indexer's own factory built embedders with max_model_len=None. That disables the `truncate_prompt_tokens` guard, and pooling models (e.g. Qwen3-Embedding) then 400/hang on context-boundary inputs (vllm-project/vllm#29496) — even when MAX_MODEL_LEN is set, because that only feeds the static embedder config. - Backfill max_model_len/embed_concurrency from the static `embedder` settings in both factories (di/container.py and services/workers/indexer_pool.py); an explicit per-endpoint `extra` value still wins. - Default max_model_len 8192 -> 2047 (one below the common 2048 boundary) so a fresh deployment is safe out of the box. - Add a minimal DEBUG embedder-construction log, plus a WARNING when max_model_len is unset or when a batched embed fails.
📝 WalkthroughWalkthroughEmbedder configuration now defaults max_model_len to 2047 with validation, the factory wiring backfills missing embedder kwargs from global settings, and VLLMEmbedder logs startup parameters plus batch-failure context. Tests cover the backfill behavior and override precedence. ChangesEmbedder defaults and backfill
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/di/test_container.py (1)
571-578: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winCover
embed_concurrencyoverride precedence too.Line 571 only locks down the explicit-override path for
max_model_len. Both factories backfill two keys, so a regression where endpointextra["embed_concurrency"]stops winning would still pass these new tests.Suggested test extension
def test_embedder_extra_overrides_backfilled_max_model_len(self): """A per-endpoint ``extra.max_model_len`` wins over the settings default.""" settings = _settings_with_named_models() settings.models.embedder["embed-a"].extra["max_model_len"] = 4096 + settings.models.embedder["embed-a"].extra["embed_concurrency"] = 9 c = ServiceContainer(settings) - assert c.embedder_factory("embed-a").kwargs["max_model_len"] == 4096 + embedder = c.embedder_factory("embed-a") + assert embedder.kwargs["max_model_len"] == 4096 + assert embedder.kwargs["embed_concurrency"] == 9🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/di/test_container.py` around lines 571 - 578, The embedder factory test currently only verifies that per-endpoint extra.max_model_len overrides the default, but it does not cover the same precedence for embed_concurrency. Extend test_embedder_extra_overrides_backfilled_max_model_len in test_container.py to also assert that ServiceContainer.embedder_factory("embed-a").kwargs["embed_concurrency"] honors settings.models.embedder["embed-a"].extra["embed_concurrency"] over any backfilled default, matching the precedence already exercised for max_model_len.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/unit/di/test_container.py`:
- Around line 571-578: The embedder factory test currently only verifies that
per-endpoint extra.max_model_len overrides the default, but it does not cover
the same precedence for embed_concurrency. Extend
test_embedder_extra_overrides_backfilled_max_model_len in test_container.py to
also assert that
ServiceContainer.embedder_factory("embed-a").kwargs["embed_concurrency"] honors
settings.models.embedder["embed-a"].extra["embed_concurrency"] over any
backfilled default, matching the precedence already exercised for max_model_len.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 37ad303f-ee81-47b0-8def-7f4ff09616c6
📒 Files selected for processing (7)
conf/config.yamlopenrag/core/config/endpoints.pyopenrag/di/container.pyopenrag/services/inference/vllm_client.pyopenrag/services/workers/indexer_pool.pytests/unit/di/test_container.pytests/unit/services/workers/test_indexer_pool.py
|
@codex review |
|
Codex Review: Didn't find any major issues. You're on a roll. Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Problem
The indexer (and API) embedders are built from the DB-backed model-endpoint config, whose
extracarries onlyimplementation+api_key.max_model_lenis dropped, soVLLMEmbedderis constructed withmax_model_len=None. That disables thetruncate_prompt_tokensguard, and pooling models such as Qwen3-Embedding then 400 / hang on context-boundary inputs (vllm-project/vllm#29496).Setting
MAX_MODEL_LENdid not help: that env var only feeds the staticembeddersettings, not the named model-endpoint embedders the indexer/API actually use. The indexer also has its own factory (_build_embedder_factory), separate from the API container'smake_component_factory, so each path dropped the value independently.Observed in the wild: a litellm-proxied endpoint returned
ContextWindowExceededError: This model's maximum context length is 2048 tokens. However, you requested 1000003 tokens. Sendingtruncate_prompt_tokensfixes it.Changes
max_model_len/embed_concurrencyfrom the staticembeddersettings in both factories —di/container.py(_embedder_extra_kwargs) andservices/workers/indexer_pool.py(_build_embedder_factory). An explicit per-endpointextravalue still wins, so it stays presetable per endpoint.max_model_len8192 → 2047(one below the common 2048 boundary) so a fresh deployment is safe out of the box.max_model_len), a WARNING whenmax_model_lenis unset, and a WARNING when a batched embed fails (with how many batches completed). No per-batch / INFO noise.Tests
test_container.py: factory backfills the two kwargs; newtest_embedder_extra_overrides_backfilled_max_model_len(explicitextrawins).test_indexer_pool.py: newtest_embedder_factory_backfills_max_model_len_from_settings(backfill + explicit-wins).extra.max_model_len=1000indexes withtruncate_prompt_tokens=999.Base branch:
refactor/hexagonal.Summary by CodeRabbit
Bug Fixes
Tests