agentic-bugfix: NVBug 6222417#663
Open
sarath-nalluri wants to merge 1 commit into
Open
Conversation
Signed-off-by: agentic-bug-fix <agentic-bug-fix@local>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Auto-generated by agentic-bugfix for NVBug
6222417.bugfix/nvbug-6222417-20260602-162848develop6222417agentic-bug-fixFull agent report
Bug Fix Report — NVBug 6222417
rgangele@nvidia.com)rag(NVIDIA RAG Blueprint) @bed5165f4a9d7ba11b46a765fc36bd46e911babdbugfix/nvbug-6222417-20260602-162848(tracksorigin/develop)1. Reported Symptom
From NVBug description (verbatim):
Comment timeline highlights:
So the bug aggregates five observable symptoms in the
notebooks/rag_library_lite_usage.ipynblite-mode notebook:milvus-lite.dbpath.ConnectionConfigException("alias error").[Errno 2] No such file or directory: '.../milvus-lite.db/collections/<name>/wal/wal_data_*.arrow'(WAL corruption).ManualCompaction"Method not implemented!" error spam during ingestion in lite mode.1.5 Custom Instructions
Source:
inlineContent:
How honored:
AskUserQuestionwas used; reproduction succeeded on attempt 1 so no Gap Analysis prompt was needed.docker psshowedrag-server, rag-frontend, ingestor-server, compose-nv-ingest-ms-runtime-1, compose-redis-1, elasticsearch, seaweedfs— zero localnim-llm*/nemotron-*NIM containers. Thenvdev.env(cloud-NIM env file) is the source of truth.--skills-path skill-source/.agents/skills— directory does not exist on disk; per the documented--skills-pathsemantics it was warn-and-skipped; the orchestrator discovered the project-localskills/rag-blueprint/(in cwd) for the deploy reference.2. Reproduction & Observed Failure Signal
Trigger used: Phase 1 Track 1A reproduced the core failure mechanism via raw
pymilvus.connections.connect()—NvidiaRAGIngestor(mode="lite")ultimately calls intopymilvus.connections.connect(alias, uri, token)(src/nvidia_rag/utils/vdb/milvus/milvus_vdb.py:240). Running this twice in the same Python process with differenturivalues reproduces the verbatim notebook-cell-rerun error.Repro script:
/tmp/repro-6222417/repro_alias.py(ephemeral, not committed).Environment: Host
viking-prod-542, Python 3.11.14 venv at/tmp/repro-6222417/.venv,pymilvus[milvus_lite]==2.6.15. Live cloud RAG stack already running (rag-server, ingestor-server, etc.) but the bug is in lite mode (in-process), so the containers do not exercise the failing code path.NVBug content retrieved (Track 1C):
scripts/maas/nvbugs_mcp.py get-bug-details --bug-id 6222417 --include-attachments)rag-server-vdb.log(37426 bytes, text/plain gzip) — auto-downloaded via REST to/tmp/nvbug-6222417-attachments/rag-server-vdb.log[image: <guid>]inline screenshots in description (78e4e71f-e056-4200-b326-55b4573ca29a,590a737a-b7dd-4efd-8334-2216c0eb64c6) —unfetchable_inline(no download API for email-embedded images). The bug text plus the attached log plus the live reproduction made the failure mode unambiguous, so the missing screenshots are evidence gaps but not blockers.Failure signal (verbatim from live reproduction,
/tmp/repro-6222417/repro_alias.pyAttempt 3):Layer: library (
pymilvus.orm.connections.Connections.connect)Why this is the root signal (not a downstream re-raise):
connections.connect()is the first place the new uri/token combination is compared against the previously-bound alias. TheConnectionConfigException(ConnDiffConf)is raised inside pymilvus itself; there is no upstream call that suppresses or wraps it. The reporter's "alias error when re-running cell" (comment #5) maps to this same exception.3. Root Cause
The notebook's lite-mode init cell (cell 20 in
notebooks/rag_library_lite_usage.ipynb) hardcodesconfig_ingestor.vector_store.url = "./milvus-lite.db"and then callsNvidiaRAGIngestor(config=config_ingestor, mode="lite"). Inside the library, this path eventually executesconnections.connect(alias, uri, token)via the pymilvus ORM, which binds the"default"alias to that uri.When the user re-runs cell 20 (a routine notebook workflow — for example after fixing a syntax error in the cell, or after a transient ingestion failure), three things happen at the same time:
milvus_lite.server_manager.server_manager_instance)."default"alias is still registered, pointing at the old uri/token.NvidiaRAGIngestorre-binds"default"to a uri that — even if textually identical — has a different underlyingMilvusClientinstance / token tuple, which pymilvus's alias-equality check classifies as different.Pymilvus's
Connections.connect()raisesConnectionConfigException(ConnDiffConf)exactly for this case. The notebook had no cleanup logic before re-binding, so every re-run hit the error. The secondary symptoms cascade from the same root:wal_data_*.arrow not found— reusing the same on-disk./milvus-lite.dbwithout releasing the prior server's file handles corrupts the SQLite WAL layout milvus-lite expects.delete_documents"not found" message — milvus-lite'sMilvusClient.deletereturns alist[int]of removed primary keys, while production Milvus returnsdict {"delete_count": N}. The library only handled the dict path, so a successful lite delete fell through todelete_count = 0and was reported as "not found".ManualCompaction"Method not implemented!" noise —MilvusVDB._compact_and_waitunconditionally calledMilvusClient.compact()after deletes; milvus-lite does not implement the RPC and returns UNIMPLEMENTED, surfacing a pymilvus error log and a follow-on warning in our own code even though the failure is benign.Call chain (primary failure):
Contributing factors:
4. Fix Applied
The fix was ported from
origin/release-v2.6.0(where the change has already been merged, code-reviewed, and shipped to RC2 users) intoorigin/developso future develop-based branches do not regress. After the initial port, two Phase 6 review cycles tightened scope: cycle 2 reverted unrelated query-text + kernel-metadata drift introduced by the port, and added unit tests for the new library branch; cycle 3 removed two upstream-port-only iterator-filter additions that R4 flagged as polish unrelated to NVBug 6222417's failure modes.Files changed:
notebooks/rag_library_lite_usage.ipynbvector_store.url = "./milvus-lite.db"with a call to a new_resolve_milvus_lite_path()helper that (a) releases any milvus-lite server still alive in this process viamilvus_lite.server_manager.server_manager_instance.release_all(), (b) iteratespymilvus.connections.list_connections()andremove_connection(alias)for each stale alias, and (c) returns a fresh per-call UUID-suffixed temp directory undertempfile.gettempdir(). Also accepts aMILVUS_LITE_DB_PATHenv-var override with fcntl-lock detection and a wipe-on-stale path. Cell 35 (rag-client init) was updated to reuse the sameMILVUS_LITE_DB_PATHso the RAG read-path opens the same db the ingestor populated.notebooks/rag_library_lite_usage.ipynb⚠️ Re-running this cell starts from a clean Milvus Lite database.warning explaining the new reset behavior.src/nvidia_rag/utils/vdb/milvus/milvus_vdb.py_compact_and_waitnow early-returns whennot self.url.scheme(true for milvus-lite file URIs, false for productionhttp:///https://). Eliminates the ManualCompaction error spam without affecting production behavior.src/nvidia_rag/utils/vdb/milvus/milvus_vdb.pydelete_documentsnow also handlesisinstance(resp, list)(the milvus-lite return shape);len(resp)is computed asdelete_countso a successful lite delete is correctly classified asdeletedrather thannot_found. The legacydictandMutationResultbranches are preserved unchanged.tests/unit/test_utils/test_vdb/test_milvus_vdb.py_make_dummy_milvus_vdb_for_deleteto also setvdb.url = urlparse(vdb.vdb_endpoint)(mirrors production__init__line 179) so the 4 existing compact tests still work after the newself.url.schemereference. Addedtest_compact_and_wait_skips_on_milvus_lite_endpoint(regression for the lite early-return),test_delete_documents_milvus_lite_list_response(regression for the list-of-PKs branch), andtest_delete_documents_milvus_lite_empty_list_response(empty-list → "not_found" classification).Diff: (
git diff HEAD)Net library code: +5 / -21 (the notebook line-count delta is dominated by JSON encoding — the upstream port consolidated
"source": ["line\n", "line\n", ...]arrays into"source": "string"form, which is functionally identical).Why this is minimal and safe:
_compact_and_waitearly-return;delete_documentslist-branch) only fire on the milvus-lite code path. Production endpoints carry anhttp://orhttps://scheme and go through the unchanged paths.5. Tests
tests/unit/test_utils/test_vdb/test_milvus_vdb.py::TestMilvusVDB::test_compact_and_wait_skips_on_milvus_lite_endpoint— asserts_compact_and_waitshort-circuits whenvdb.url.scheme == ""(lite file URI) by verifyingvdb._client.compact.assert_not_called()andvdb._client.get_compaction_state.assert_not_called().test_delete_documents_milvus_lite_list_response— asserts that whenMilvusClient.deletereturns[101, 102, 103](lite path),result_dict["deleted"] == ["file1.txt"]andresult_dict["not_found"] == [].test_delete_documents_milvus_lite_empty_list_response— asserts that an empty list response from lite is correctly classified intoresult_dict["not_found"].50 passed, 1 warningfor the targetedtest_milvus_vdb.py(4 pre-existing compact tests still pass with the fixture update + the 3 new tests + the 43 unchanged peers). Full unit suite (tests/unit/excludingtests/unit/test_rag_perfandtests/unit/test_ingestor_server/test_nemo_retriever/*andtest_mcp/test_cwe22_path_traversal.pyfor missing-optional-deps):1891 passed, 15 skipped, 1 xfailed. The 5 failures + 3 errors observed on the venv are pre-existing on baseline (verified viagit stash→ identical failures onHEAD) and are not regressions from this fix. Command:.venv/bin/pytest tests/unit/ --ignore=… --no-header -q -W ignore::DeprecationWarning -W ignore::PendingDeprecationWarning..venv/bin/ruff check src/nvidia_rag/utils/vdb/milvus/milvus_vdb.py tests/unit/test_utils/test_vdb/test_milvus_vdb.py→All checks passed!6. Live E2E Validation
Replay trigger:
/tmp/repro-6222417/validate_fix.py— a 5-step focused E2E that (a) reproduces the bug in a negative control step that bypasses the new helper, then (b) confirms the helper's alias-cleanup + server-release sequence resolves it.Observed result: All 5 steps PASS. The verbatim
ConnectionConfigException(ConnDiffConf)reappears in STEP 2 (proves the test setup mirrors the bug) and is absent in STEP 3/STEP 4 after_resolve_milvus_lite_path()runs (proves the fix lands).Evidence (final run after cycle 3):
Deploy note (transparency, see §8 Incidental Findings #1): the running cloud
ingestor-server/rag-servercontainers were not rebuilt with the updatedmilvus_vdb.py. The first rebuild attempt withdocker compose -f deploy/compose/docker-compose-ingestor-server.yaml build ingestor-serverfailed because the host shell only hadNVIDIA_API_KEYexported, notNGC_API_KEY(the env file usesexport NVIDIA_API_KEY=${NGC_API_KEY}, which leaves both empty unless NGC_API_KEY is set externally). This is not blocking for this bug because:milvus_vdb.pyare no-ops for production:_compact_and_waitearly-returns only whennot self.url.scheme(production endpoints havehttp://orhttps://); the newdelete_documentslist-branch only triggers whenisinstance(resp, list)(production Milvus returns adict)./tmp/repro-6222417/validate_fix.py) + the editableuv pip install -e .reflecting the fix in the venv'snvidia_ragpackage, confirmed viainspect.getsource(MilvusVDB._compact_and_wait).6.5 Expert Review
Aggregated verdict: approve (after 3 cycles).
Cycles used: 3 of 3 (within budget).
Cycle 1 → 2: R4 flagged that the upstream port pulled in three out-of-scope changes — (a) query text
"hammer"swapped to"lion"in two cells, (b) kernel metadata drift (display_name.venv→notebooks (3.11.10), version3.12.3→3.11.10), (c)_MilvusLiteIteratorNoiseFilter+Collection|utilitywarnings filter polish. R5 flagged that the new library list-handling branch indelete_documentshad no unit test. Cycle 2 reverted (a) and (b), and added two new unit tests (test_delete_documents_milvus_lite_list_response,test_delete_documents_milvus_lite_empty_list_response).Cycle 2 → 3: The focused re-review confirmed (a) and (b) and the new tests were correctly applied, but re-raised (c) (iterator-filter scope) as a remaining
major. Cycle 3 reverted the iterator-filter andCollection|utilitywarnings additions in the logging cell, retaining only the_GrpcAllocTimestampFilter+ theconnections.has_connectionwarnings filter that were already in the develop baseline. The bug-fix logic (_resolve_milvus_lite_pathand the two library guards) is untouched.Non-blocking notes (R3 findings on the upstream-validated fix, documented per scope rule):
These four R3 majors are intrinsic to the
release-v2.6.0fix that was ported. Iterating to "improve" them would diverge from the upstream code that the RAG team has already merged, reviewed, and shipped to RC2 — i.e., it would expand the bug-fix scope into a separate refactor. Per the skill's scope rule ("Only change code that is directly responsible for the reported bug. If you discover other defects during investigation, document them in the completion report but do not fix them without explicit user approval."), they are surfaced here as known limitations for the human triager to optionally fast-follow:src/nvidia_rag/utils/vdb/milvus/milvus_vdb.py:592—if not self.url.scheme:early-return relies onurlparse(self.vdb_endpoint).schemereturning""for milvus-lite file paths and a truthy value for production. This holds for every documented production endpoint (http://...,https://...) and every documented lite endpoint (relative or absolute file paths). A user manually settingAPP_VECTORSTORE_URL=localhost:19530(no scheme) would route throughscheme='localhost'in Python 3.11 butscheme=''in Python 3.12+ — i.e. behavior depends on Python version. Suggested fast-follow: unify lite detection by storingself._is_lite: boolonce in__init__and reusing it. (R3, file:line 592, severity originallymajor, classified here asminorbecause no documented input shape triggers the false-skip path.)notebooks/rag_library_lite_usage.ipynb(cell 20) —_milvus_lite_path_is_locked— swallows allOSErrorexceptions on the fcntl probe, includingEACCES/EPERM. A permission-denied lock file would be treated as unlocked, allowing_wipe_milvus_lite_pathto nuke a live database. Unlikely in practice (lock files are created by the same user as the running kernel); recorded as a follow-up. (R3, severity originallymajor, classified here asminor.)notebooks/rag_library_lite_usage.ipynb(cell 20) —_wipe_milvus_lite_pathoverride branch —MILVUS_LITE_DB_PATHis user-controlled. If pointed at a symlink to an arbitrary directory,db_path.is_dir()follows the symlink andshutil.rmtreedeletes the target. Mitigated in practice by the fact that the user explicitly opted in via env var on their own machine, but a hardened version woulddb_path.resolve(strict=True)and refuse symlinks. (R3, severity originallymajor, classified here asminor.)notebooks/rag_library_lite_usage.ipynb(cell 20) — bareexcept Exception: passaroundserver_manager_instance.release_all()and theconnections.remove_connectionloop. Defensive on purpose (the helper must not fail the notebook if milvus-lite is uninstalled or a connection is already dead), but silently swallows new failure modes. Should log atdebuglevel. (R3, severity originallymajor, classified here asminor.)R1 noted that the cleanup logic would arguably belong inside
NvidiaRAGIngestor.__init__whenmode="lite"so non-notebook library users also get the protection. That is a library-API design change and is out of scope for this bug fix — recorded as asuggestionfor the RAG team.R5 noted that the primary failure (alias collision in notebook code) is not covered by a unit test because the notebook code is not unit-testable in this repo (no nbclient/nbconvert harness). The new library tests cover the secondary failures (compaction skip + delete list-handling). The primary failure is covered by the (ephemeral) E2E
validate_fix.py. This is a known gap inherited from the repo's existing test infrastructure, not a regression.7. Attempt Timeline
/tmp/repro-6222417/repro_alias.pyto trigger pymilvusconnections.connect(uri=db2_raw)after first init bound"default"todb1ConnectionConfigException: (code=1, message=Alias of 'default' already creating connections, but the configure is not the same as passed in.)git checkout origin/release-v2.6.0 -- notebooks/rag_library_lite_usage.ipynb src/nvidia_rag/utils/vdb/milvus/milvus_vdb.py_resolve_milvus_lite_path, 4×MILVUS_LITE_DB_PATH, 1× alias cleanup, 1× server release_all)pytest tests/unit/test_utils/test_vdb/test_milvus_vdb.pyafter updating_make_dummy_milvus_vdb_for_deletehelper + adding newtest_compact_and_wait_skips_on_milvus_lite_endpointregression test;ruff check; E2E viavalidate_fix.py"lion"→"hammer", 2 cells) and R4b (kernel metadata.venv/3.12.3); addedtest_delete_documents_milvus_lite_list_response+test_delete_documents_milvus_lite_empty_list_responsefor R5major:_MilvusLiteIteratorNoiseFilter+Collection|utilitywarnings filter additions in logging cell remain out-of-scope polish. Re-enter Phase 4._GrpcAllocTimestampFilterandconnections.has_connectionwarnings filter); re-encoded the notebook withensure_ascii=Falseto avoid\uXXXX-escape noise on the unicode⚠️chars+90 / -92net; fix logic still intact8. Incidental Findings
NGC_API_KEY, but the host only exportsNVIDIA_API_KEY.deploy/compose/nvdev.envline 1 readsexport NVIDIA_API_KEY=${NGC_API_KEY}, meaningNVIDIA_API_KEYis derived fromNGC_API_KEYrather than the other way around. A user withNVIDIA_API_KEYalready set externally still gets a missing-NGC_API_KEYerror duringdocker compose build. Suggested fix (out-of-scope for NVBug 6222417): add: ${NGC_API_KEY:=${NVIDIA_API_KEY:?NGC_API_KEY or NVIDIA_API_KEY must be set}}as the first line ofnvdev.envto deriveNGC_API_KEYfromNVIDIA_API_KEYwhen only one of the two is provided.tests/unit/test_rag_perf/conftest.py,tests/unit/test_ingestor_server/test_nemo_retriever/*.py, andtests/unit/test_mcp/test_cwe22_path_traversal.pyfail to collect on a freshuv pip install -e ".[all]"venv becauseruamel.yaml,nemo_retriever, and the cwe22 test's transitive deps are not pulled in by any pyproject extra. Pre-existing onHEAD(verified viagit stash), unrelated to NVBug 6222417. Suggested fix: add a[project.optional-dependencies] tests = [...]group, or vendorruamel.yamland the nemo-retriever stub._resolve_milvus_lite_path()helper functions ideally belong insidenvidia_rag.utils(e.g. a newmilvus_lite_utils.py) and the alias-cleanup should run automatically insideNvidiaRAGIngestor.__init__whenmode=="lite", so non-notebook library users also get the protection. Currently any Python script that constructsNvidiaRAGIngestor(mode="lite")twice in the same process hits the same alias error. Documented as asuggestion— not fixed because it is a library API design change outside the scope of NVBug 6222417.urlparse.schemefragility,_milvus_lite_path_is_lockedpermission-denied swallow, symlink resolution inMILVUS_LITE_DB_PATH, and bare-except patterns. Inherited from the upstreamrelease-v2.6.0fix.9. Follow-ups for the Human
git diff HEAD— 3 files, +90/-92 LOC).bugfix/nvbug-6222417-20260602-162848(the skill intentionally does not create the commit perDo not create a git commit.directive).--no-nvbugs-updateflag passed to this run (see §10).NGC_API_KEYenv-derivation innvdev.env, missing test-deps in the venv install, and the architectural suggestion to push lite cleanup into the library.ingestor-server(and optionallyrag-server) on a host withNGC_API_KEYexported when convenient. The fix does not affect production container behavior, so this is a hygiene step rather than a required deploy.10. NVBugs Audit Trail
--no-nvbugs-updateflag passed to this runNotes:
Closed/ BugActionQA - Closed - Verified/ DispositionBug - Fixed(closed on2026-05-28after the fix landed onrelease-v2.6.0). This run ports that already-validated fix fromrelease-v2.6.0todevelopso future develop-based branches do not regress.8. Resumption Log
(empty — this run had no escalations / resumptions)
11. Review Iterations
(empty on the first Phase 7 invocation; review iterations append rows here)