Architectural change by Madhan230205 · Pull Request #1 · Madhan230205/token-reducer

Madhan230205 · 2026-04-03T01:23:35Z

No description provided.

qodo-code-review · 2026-04-03T01:23:50Z

Review Summary by Qodo

Migrate to typer CLI, add Pydantic models, and implement context lifecycle management

✨ Enhancement 🧪 Tests

Walkthroughs

Description

• Migrate CLI from argparse to typer for modern async support
• Add Pydantic models for type-safe data structures
• Implement context lifecycle management commands (sync, gc, stats)
• Add comprehensive test suite for chunker, db, and embeddings modules
• Add project configuration and CI/CD workflows

Diagram

flowchart LR
  argparse["argparse CLI"] -->|migrate| typer["typer Framework"]
  dataclass["dataclass Models"] -->|convert| pydantic["Pydantic BaseModel"]
  typer -->|new commands| sync["sync Command"]
  typer -->|new commands| gc["gc Command"]
  typer -->|new commands| stats["stats Command"]
  db["db Module"] -->|add functions| lifecycle["Lifecycle Management"]
  lifecycle -->|detect| changes["File Changes"]
  lifecycle -->|cleanup| garbage["Garbage Collection"]
  compressor["compressor Module"] -->|add| floor["Relevance Floor"]
  floor -->|filter| quality["Low-Quality Chunks"]
  tests["Test Suite"] -->|cover| modules["chunker, db, embeddings"]
  config["pyproject.toml"] -->|enable| packaging["PyPI Distribution"]
  ci["CI/CD Workflows"] -->|automate| checks["Lint, Test, Publish"]

File Changes

1. scripts/token_reducer/cli.py ✨ Enhancement +645/-425

Replace argparse with typer CLI framework

scripts/token_reducer/cli.py

2. scripts/token_reducer/models.py ✨ Enhancement +161/-3

Convert dataclasses to Pydantic BaseModel

scripts/token_reducer/models.py

3. scripts/token_reducer/compressor.py ✨ Enhancement +78/-46

Add relevance floor and Pydantic model returns

scripts/token_reducer/compressor.py

View more (13)

4. scripts/token_reducer/pipeline.py ✨ Enhancement +33/-22

Update pipeline to use Pydantic ContextPacket

scripts/token_reducer/pipeline.py

5. scripts/token_reducer/db.py ✨ Enhancement +272/-1

Add lifecycle management and sync functions

scripts/token_reducer/db.py

6. scripts/token_reducer/config.py ✨ Enhancement +2/-1

Add relevance floor and expand top-k pool

scripts/token_reducer/config.py

7. scripts/token_reducer/chunker.py ✨ Enhancement +3/-2

Refactor import resolution path handling

scripts/token_reducer/chunker.py

8. scripts/token_reducer/retriever.py ✨ Enhancement +3/-2

Remove bounded top-k restriction for filtering

scripts/token_reducer/retriever.py

9. tests/test_chunker.py 🧪 Tests +359/-0

Comprehensive unit tests for chunker module

tests/test_chunker.py

10. tests/test_db.py 🧪 Tests +342/-0

Comprehensive unit tests for database module

tests/test_db.py

11. tests/test_embeddings.py 🧪 Tests +146/-0

Comprehensive unit tests for embeddings module

tests/test_embeddings.py

12. pyproject.toml ⚙️ Configuration changes +85/-0

Add project metadata and build configuration

pyproject.toml

13. .github/workflows/ci.yml ⚙️ Configuration changes +36/-0

Add GitHub Actions CI pipeline

.github/workflows/ci.yml

14. .github/workflows/publish.yml ⚙️ Configuration changes +39/-0

Add GitHub Actions PyPI publish workflow

.github/workflows/publish.yml

15. Makefile ⚙️ Configuration changes +33/-0

Add development task automation

Makefile

16. tests/__init__.py Additional files +0/-0

...

tests/init.py

qodo-code-review · 2026-04-03T01:23:51Z

Code Review by Qodo

🐞 Bugs (4) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX Issues (0)

1. Benchmark result type crash 🐞 Bug ≡ Correctness

Description

cli.benchmark() calls result.get(...) on the return value of run_retrieval_pipeline(), but that
function now returns a ContextPacket Pydantic model. The benchmark command will raise AttributeError
and abort on its first query.

Code

scripts/token_reducer/cli.py[R557-566]

+            compressed_tokens = result.get("token_metrics", {}).get("compressed_tokens", 0)
+            selected_tokens = result.get("token_metrics", {}).get("selected_chunk_tokens", 0)
+            compressed_tokens_total += compressed_tokens
+            query_metrics.append({
+                "query": q,
+                "latency_ms": round(q_latency_ms, 2),
+                "fts_hits": result.get("retrieval", {}).get("fts_hits", 0),
+                "vector_hits": result.get("retrieval", {}).get("vector_hits", 0),
+                "selected_chunks": result.get("selected_chunks", 0),
+                "selected_tokens": selected_tokens,

Evidence
benchmark() treats the pipeline result as a dict via .get(), but run_retrieval_pipeline is defined
to return a ContextPacket model, which does not implement dict-style .get().
scripts/token_reducer/cli.py[535-569]
scripts/token_reducer/pipeline.py[44-62]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`benchmark()` assumes `run_retrieval_pipeline()` returns a dict and uses `result.get(...)`, but it now returns a `ContextPacket` model. This crashes benchmarking with `AttributeError`.
## Issue Context
`run_retrieval_pipeline(...)-> ContextPacket` and other CLI commands (e.g., `query`, `run`) already access `result.packet` / `result.model_dump()`.
## Fix Focus Areas
- scripts/token_reducer/cli.py[535-569]
## What to change
- Replace dict-style access with model access, e.g.:
- `compressed_tokens = result.token_metrics.compressed_tokens`
- `selected_tokens = result.token_metrics.selected_chunk_tokens`
- `fts_hits = result.retrieval.fts_hits`
- `vector_hits = result.retrieval.vector_hits`
- `selected_chunks = result.selected_chunks`
- Alternatively: `r = result.model_dump()` once per query and keep the existing dict-based extraction from `r`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Sync deletes unrelated documents 🐞 Bug ≡ Correctness

Description

detect_file_changes() marks any indexed source not present in the current expanded --inputs list as
“deleted”, and sync() removes those documents immediately. Syncing a subset of paths can therefore
wipe unrelated indexed documents/chunks/embeddings from the DB.

Code

scripts/token_reducer/db.py[R776-780]

+    # Find deleted files (in index but not in provided file list)
+    current_sources = {str(p) for p in file_paths}
+    deleted_sources = [s for s in indexed_sources if s not in current_sources]
+
+    return new_files, modified_files, deleted_sources

Evidence

Deletion detection is implemented as a set difference between all indexed sources and the current
file_paths argument, not as a filesystem existence check. sync() passes only the current
collect_input_files(inputs) expansion and then calls remove_documents_by_source() on the
computed deletions.

scripts/token_reducer/db.py[747-780]
scripts/token_reducer/cli.py[655-689]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`sync` can delete indexed documents that are simply not included in the current `--inputs` set (not actually deleted from disk). This is destructive and surprising when users sync only a subset of their project.
## Issue Context
`detect_file_changes()` currently computes `deleted_sources` as `indexed_sources - current_sources`, where `current_sources` is derived only from the provided `file_paths` argument.
## Fix Focus Areas
- scripts/token_reducer/db.py[747-780]
- scripts/token_reducer/cli.py[655-690]
## What to change
Choose one of these safe semantics:
1) **Filesystem-based deletion** (recommended default):
- Compute deletions by checking whether each indexed `source` still exists on disk (`Path(source).exists()`), not by absence from the current input list.
- Optionally restrict this check to sources under the provided inputs’ directories.
2) **Opt-in pruning**:
- Add a `--prune-missing/--no-prune` flag (default `False`).
- Only call `remove_documents_by_source` when pruning is enabled.
3) **Persist sync scope**:
- Persist the root inputs used to build the DB and only prune within that scope.
Also update help text to match the chosen behavior.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. file_mtime not set on index 🐞 Bug ☼ Reliability

Description

Sync change detection relies on documents.file_mtime, but normal indexing
(upsert_document/index_corpus) never populates file_mtime, so stored_mtime is NULL. This causes the
first sync after an index to treat all previously indexed files as modified and reindex them,
reducing incremental-sync value.

Code

scripts/token_reducer/db.py[R767-772]

+            stored_mtime, stored_hash = indexed[source]
+            try:
+                current_mtime = path.stat().st_mtime
+                if stored_mtime is None or current_mtime > stored_mtime:
+                    modified_files.append(path)
+            except OSError:

Evidence
detect_file_changes() flags a file as modified when stored_mtime is None. upsert_document inserts
documents without setting file_mtime, and index_corpus doesn’t update it either, so freshly indexed
DBs will have NULL mtimes until sync explicitly updates them.
scripts/token_reducer/db.py[762-772]
scripts/token_reducer/db.py[366-373]
scripts/token_reducer/cli.py[716-720]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`documents.file_mtime` is used for incremental sync, but it is not written during `index`/`run` indexing paths. This makes the first `sync` after indexing do a full reindex.
## Issue Context
- `detect_file_changes`: `if stored_mtime is None or current_mtime > stored_mtime: modified_files.append(path)`
- `upsert_document` inserts into `documents` without `file_mtime`
- `sync` does call `update_document_mtime`, but only for files reindexed in sync.
## Fix Focus Areas
- scripts/token_reducer/db.py[335-454]
- scripts/token_reducer/db.py[945-961]
## What to change
- Option A (simple): in `index_corpus`, after each `upsert_document(...)`, call `update_document_mtime(conn, source=str(path), mtime=path.stat().st_mtime)`.
- Option B: extend `upsert_document(..., file_mtime: float | None = None, file_hash: str | None = None)` and set these columns on INSERT/UPDATE.
Ensure the value is set for both insert and update cases.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

4. Unpinned GitHub actions 🐞 Bug ⛨ Security

Description

Workflows use mutable action tags (e.g., actions/checkout@v4,
pypa/gh-action-pypi-publish@release/v1) rather than immutable commit SHAs. This increases CI/CD
supply-chain risk, especially for the PyPI publish workflow.

Code

.github/workflows/publish.yml[R13-15]

+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:

Evidence
Both CI and publish workflows reference third-party actions by moving tags, which can be retargeted
upstream.
.github/workflows/ci.yml[13-16]
.github/workflows/publish.yml[13-16]
.github/workflows/publish.yml[39-39]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
GitHub Actions are referenced by mutable tags; pinning to commit SHAs reduces the risk of supply-chain attacks via upstream tag retargeting.
## Issue Context
This is most important for the publish workflow that can release artifacts.
## Fix Focus Areas
- .github/workflows/ci.yml[13-31]
- .github/workflows/publish.yml[13-39]
## What to change
- Replace `uses: owner/action@vX` with `uses: owner/action@<full_commit_sha>` (optionally keep a comment with the human version).
- Do the same for `pypa/gh-action-pypi-publish@release/v1` and artifact actions.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

qodo-code-review · 2026-04-03T01:24:01Z

CI Feedback 🧐

A test triggered by this PR failed. Here is an AI-generated analysis of the failure:

Action: Lint (ruff)
Failed stage: Run ruff check scripts/ tests/ [❌]
Failed test name: ""
Failure summary: The action failed because the linting/formatting step (Ruff) reported code quality violations and exited with status `1`. - Ruff found `49` errors across multiple files (e.g., `scripts/apply_diff.py`, `scripts/token_reducer/chunker.py`, `scripts/token_reducer/db.py`, `tests/test_embeddings.py`), including: - Unused variables (e.g., `B007` unused loop variable `b` at `scripts/apply_diff.py:413:17` and `:421:17`; `F841` unused `applied_before` at `scripts/apply_diff.py:462:13`). - Unused imports (e.g., `F401` `tree_sitter` unused at `scripts/token_reducer/chunker.py:406:16`; `.models.embedding_cache_key` unused at `scripts/token_reducer/db.py:15:5`). - Import sorting/formatting issues (`I001` in several files, including `scripts/token_reducer/chunker.py:423:9`, `scripts/token_reducer/cli.py:17:1`, and `tests/test_embeddings.py:1:1`). - Style recommendation (`SIM105`) to replace `try/except/pass` with `contextlib.suppress` at `scripts/token_reducer/db.py:146:13`. - The job ended at `Process completed with` `exit code 1` because these Ruff errors were not fixed (though `37` are auto-fixable with `--fix`).
Relevant error logs: 1: ##[group]Runner Image Provisioner 2: Hosted Compute Agent ... 174: 219 \| ) -> tuple[bool, str, str]: 175: 220 \| if block.search not in content: 176: 221 \| return False, f"Search text not found", content 177: \| ^^^^^^^^^^^^^^^^^^^^^^^^ 178: 222 \| 179: 223 \| occurrences = content.count(block.search) 180: \| 181: help: Remove extraneous `f` prefix 182: B007 Loop control variable `b` not used within loop body 183: --> scripts/apply_diff.py:413:17 184: \| 185: 411 \| for target, file_blocks in file_groups.items(): 186: 412 \| if not target.exists(): 187: 413 \| for b in file_blocks: 188: \| ^ 189: 414 \| results["failed"] += 1 190: 415 \| results["messages"].append(f"File not found: {target}") 191: \| 192: help: Rename unused `b` to `_b` 193: B007 Loop control variable `b` not used within loop body 194: --> scripts/apply_diff.py:421:17 195: \| 196: 419 \| original = target.read_text(encoding="utf-8") 197: 420 \| except Exception as exc: 198: 421 \| for b in file_blocks: 199: \| ^ 200: 422 \| results["failed"] += 1 201: 423 \| results["messages"].append(f"Failed to read {target}: {exc}") 202: \| 203: help: Rename unused `b` to `_b` 204: F841 Local variable `applied_before` is assigned to but never used 205: --> scripts/apply_diff.py:462:13 206: \| 207: 460 \| … else: 208: 461 \| … # Entire file transaction failed — original untouched on disk 209: 462 \| … applied_before = sum(1 for m in tx_messages if "rolled back" not in m and "not found" not in m and not m.endswith(tx_messag… 210: \| ^^^^^^^^^^^^^^ 211: 463 \| … results["failed"] += len(file_blocks) 212: 464 \| … results["messages"].extend(tx_messages) ... 314: 6 \| from pathlib import Path 315: 7 \| from typing import Sequence 316: \| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 317: 8 \| 318: 9 \| from .config import ( 319: \| 320: help: Import from `collections.abc` 321: F401 `tree_sitter` imported but unused; consider using `importlib.util.find_spec` to test for availability 322: --> scripts/token_reducer/chunker.py:406:16 323: \| 324: 404 \| return _TREE_SITTER_AVAILABLE 325: 405 \| try: 326: 406 \| import tree_sitter # type: ignore 327: \| ^^^^^^^^^^^ 328: 407 \| _TREE_SITTER_AVAILABLE = True 329: 408 \| except ImportError: 330: \| 331: help: Remove unused import: `tree_sitter` 332: I001 [] Import block is un-sorted or un-formatted 333: --> scripts/token_reducer/chunker.py:423:9 334: \| 335: 422 \| try: 336: 423 \| / import tree_sitter # type: ignore 337: 424 \| \| import importlib 338: \| \|________________________^ 339: 425 \| grammar = importlib.import_module(grammar_module) 340: \| 341: help: Organize imports 342: I001 [] Import block is un-sorted or un-formatted 343: --> scripts/token_reducer/cli.py:17:1 344: \| 345: 15 \| sys.stdout.reconfigure(encoding="utf-8", errors="replace") 346: 16 \| ... 611: 9 \| 612: 10 \| from .models import ( 613: \| 614: help: Import from `collections.abc` 615: F401 [] `.models.embedding_cache_key` imported but unused 616: --> scripts/token_reducer/db.py:15:5 617: \| 618: 13 \| utc_now_epoch, 619: 14 \| hash_text, 620: 15 \| embedding_cache_key, 621: \| ^^^^^^^^^^^^^^^^^^^ 622: 16 \| ) 623: 17 \| from .chunker import ( 624: \| 625: help: Remove unused import: `.models.embedding_cache_key` 626: SIM105 Use `contextlib.suppress(sqlite3.OperationalError)` instead of `try`-`except`-`pass` 627: --> scripts/token_reducer/db.py:146:13 628: \| 629: 144 \| "ALTER TABLE documents ADD COLUMN file_hash TEXT", 630: 145 \| ): 631: 146 \| / try: 632: 147 \| \| conn.execute(statement) 633: 148 \| \| except sqlite3.OperationalError: 634: 149 \| \| pass 635: \| \|____________________^ 636: 150 \| 637: 151 \| conn.execute( 638: \| 639: help: Replace `try`-`except`-`pass` with `with contextlib.suppress(sqlite3.OperationalError): ...` 640: I001 [] Import block is un-sorted or un-formatted ... 865: --> tests/test_embeddings.py:1:1 866: \| 867: 1 \| / from __future__ import annotations 868: 2 \| \| 869: 3 \| \| import math 870: 4 \| \| 871: 5 \| \| from token_reducer.embeddings import ( 872: 6 \| \| cosine_similarity, 873: 7 \| \| embed_text, 874: 8 \| \| embed_text_hash, 875: 9 \| \| resolve_embedding_backend, 876: 10 \| \| ) 877: \| \|_^ 878: \| 879: help: Organize imports 880: Found 49 errors. 881: [*] 37 fixable with the `--fix` option (10 hidden fixes can be enabled with the `--unsafe-fixes` option). 882: ##[error]Process completed with exit code 1. 883: Post job cleanup.

qodo-code-review · 2026-04-03T01:27:15Z

+            compressed_tokens = result.get("token_metrics", {}).get("compressed_tokens", 0)
+            selected_tokens = result.get("token_metrics", {}).get("selected_chunk_tokens", 0)
+            compressed_tokens_total += compressed_tokens
+            query_metrics.append({
+                "query": q,
+                "latency_ms": round(q_latency_ms, 2),
+                "fts_hits": result.get("retrieval", {}).get("fts_hits", 0),
+                "vector_hits": result.get("retrieval", {}).get("vector_hits", 0),
+                "selected_chunks": result.get("selected_chunks", 0),
+                "selected_tokens": selected_tokens,


1. Benchmark result type crash 🐞 Bug ≡ Correctness

cli.benchmark() calls result.get(...) on the return value of run_retrieval_pipeline(), but that function now returns a ContextPacket Pydantic model. The benchmark command will raise AttributeError and abort on its first query.

Agent Prompt

## Issue description `benchmark()` assumes `run_retrieval_pipeline()` returns a dict and uses `result.get(...)`, but it now returns a `ContextPacket` model. This crashes benchmarking with `AttributeError`. ## Issue Context `run_retrieval_pipeline(...)-> ContextPacket` and other CLI commands (e.g., `query`, `run`) already access `result.packet` / `result.model_dump()`. ## Fix Focus Areas - scripts/token_reducer/cli.py[535-569] ## What to change - Replace dict-style access with model access, e.g.: - `compressed_tokens = result.token_metrics.compressed_tokens` - `selected_tokens = result.token_metrics.selected_chunk_tokens` - `fts_hits = result.retrieval.fts_hits` - `vector_hits = result.retrieval.vector_hits` - `selected_chunks = result.selected_chunks` - Alternatively: `r = result.model_dump()` once per query and keep the existing dict-based extraction from `r`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-04-03T01:27:16Z

+    # Find deleted files (in index but not in provided file list)
+    current_sources = {str(p) for p in file_paths}
+    deleted_sources = [s for s in indexed_sources if s not in current_sources]
+
+    return new_files, modified_files, deleted_sources


2. Sync deletes unrelated documents 🐞 Bug ≡ Correctness

detect_file_changes() marks any indexed source not present in the current expanded --inputs list as “deleted”, and sync() removes those documents immediately. Syncing a subset of paths can therefore wipe unrelated indexed documents/chunks/embeddings from the DB.

Agent Prompt

## Issue description `sync` can delete indexed documents that are simply not included in the current `--inputs` set (not actually deleted from disk). This is destructive and surprising when users sync only a subset of their project. ## Issue Context `detect_file_changes()` currently computes `deleted_sources` as `indexed_sources - current_sources`, where `current_sources` is derived only from the provided `file_paths` argument. ## Fix Focus Areas - scripts/token_reducer/db.py[747-780] - scripts/token_reducer/cli.py[655-690] ## What to change Choose one of these safe semantics: 1) **Filesystem-based deletion** (recommended default): - Compute deletions by checking whether each indexed `source` still exists on disk (`Path(source).exists()`), not by absence from the current input list. - Optionally restrict this check to sources under the provided inputs’ directories. 2) **Opt-in pruning**: - Add a `--prune-missing/--no-prune` flag (default `False`). - Only call `remove_documents_by_source` when pruning is enabled. 3) **Persist sync scope**: - Persist the root inputs used to build the DB and only prune within that scope. Also update help text to match the chosen behavior.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copilot

Pull request overview

This PR refactors the token-reducer pipeline toward a more structured, “package-like” architecture: migrating key payloads to Pydantic models, switching the CLI to Typer/Rich, expanding retrieval defaults, and adding index lifecycle utilities (sync/gc/stats) plus a new test suite.

Changes:

Introduce Pydantic models for retrieval candidates and the pipeline’s ContextPacket, updating pipeline/compressor code paths accordingly.
Replace the argparse CLI with a Typer-based CLI and add lifecycle commands (sync, gc, stats) and improved UX (progress output).
Add comprehensive pytest coverage for chunking, embeddings, and DB behavior; adjust retrieval defaults (e.g., DEFAULT_TOP_K=50) and add a relevance floor for compression.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`scripts/token_reducer/models.py`	Replace dataclass usage with Pydantic models (ContextPacket and related types).
`scripts/token_reducer/pipeline.py`	Return `ContextPacket`, add relevance-floor support, and adapt cache handling.
`scripts/token_reducer/compressor.py`	Add relevance floor to compression and return a typed `ContextPacket` from `build_packet`.
`scripts/token_reducer/cli.py`	Migrate to Typer/Rich; add lifecycle commands; update output handling for model-based results.
`scripts/token_reducer/db.py`	Extend schema for mtime/hash and add lifecycle helpers (stats/sync/gc utilities).
`scripts/token_reducer/retriever.py`	Remove top-k bounding to pass full candidate pool downstream.
`scripts/token_reducer/config.py`	Increase default `top_k` and introduce `DEFAULT_RELEVANCE_FLOOR`.
`scripts/token_reducer/chunker.py`	Minor import-resolution refactor.
`tests/test_chunker.py`	Add unit tests for chunking/tokenization/import parsing utilities.
`tests/test_db.py`	Add unit tests for DB schema, caching, session memory, indexing behavior.
`tests/test_embeddings.py`	Add unit tests for embedding backends and cosine similarity behavior.
`pyproject.toml`	Add packaging + tooling configuration (hatch, ruff, mypy, pytest).
`Makefile`	Add common dev commands (test/lint/format/typecheck/check).
`.github/workflows/ci.yml`	Add CI for ruff + pytest across Python 3.11/3.12.
`.github/workflows/publish.yml`	Add tag-based PyPI publishing workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T01:29:24Z

 from datetime import datetime, timezone
 from hashlib import blake2b
 from pathlib import Path
+from typing import Any


Any is imported but never used, which will fail the repo's ruff lint step (F401). Remove the import or use it in a type annotation.

Suggested change

from typing import Any

Copilot · 2026-04-03T01:29:24Z

+    CacheInfo,
+    ContextPacket,
+    RetrievalInfo,
+    SessionMemory,


CacheInfo and SessionMemory are imported but not used anywhere in this module, which will fail ruff lint (F401). Please drop these unused imports (or reference them explicitly if intended for typing).

Suggested change

CacheInfo,

ContextPacket,

RetrievalInfo,

SessionMemory,

ContextPacket,

RetrievalInfo,

Copilot · 2026-04-03T01:29:24Z

    word_budget: int,
-) -> dict:
+    relevance_floor: float = DEFAULT_RELEVANCE_FLOOR,
+) -> ContextPacket:


relevance_floor is now an input to run_retrieval_pipeline, but it is not included in the query-cache key material. If callers pass a non-default relevance_floor, cached packets can be reused across different floors and produce inconsistent results. Add relevance_floor to cache_key_material.

Copilot · 2026-04-03T01:29:24Z

+        # Reconstruct ContextPacket from cached dict
+        cached_packet = ContextPacket.model_validate(cached_result)
+        cached_packet.session_memory = SessionMemory(


ContextPacket.model_validate(cached_result) can raise a ValidationError if the on-disk cache payload is from an older schema or is corrupted. Right now that exception will abort the pipeline on an otherwise recoverable cache hit. Wrap validation in a try/except and treat invalid cache entries as a miss (delete + recompute).

Copilot · 2026-04-03T01:29:25Z

+            compressed_tokens = result.get("token_metrics", {}).get("compressed_tokens", 0)
+            selected_tokens = result.get("token_metrics", {}).get("selected_chunk_tokens", 0)
+            compressed_tokens_total += compressed_tokens
+            query_metrics.append({
+                "query": q,
+                "latency_ms": round(q_latency_ms, 2),
+                "fts_hits": result.get("retrieval", {}).get("fts_hits", 0),
+                "vector_hits": result.get("retrieval", {}).get("vector_hits", 0),
+                "selected_chunks": result.get("selected_chunks", 0),


run_retrieval_pipeline now returns a ContextPacket model, but this code still treats result like a dict via .get(...). This will raise at runtime. Use result.token_metrics.compressed_tokens, result.retrieval.fts_hits, result.selected_chunks, etc. (or call result.model_dump() once and work with the dict).

Suggested change

compressed_tokens = result.get("token_metrics", {}).get("compressed_tokens", 0)

selected_tokens = result.get("token_metrics", {}).get("selected_chunk_tokens", 0)

compressed_tokens_total += compressed_tokens

query_metrics.append({

"query": q,

"latency_ms": round(q_latency_ms, 2),

"fts_hits": result.get("retrieval", {}).get("fts_hits", 0),

"vector_hits": result.get("retrieval", {}).get("vector_hits", 0),

"selected_chunks": result.get("selected_chunks", 0),

token_metrics = getattr(result, "token_metrics", None)

retrieval = getattr(result, "retrieval", None)

compressed_tokens = getattr(token_metrics, "compressed_tokens", 0)

selected_tokens = getattr(token_metrics, "selected_chunk_tokens", 0)

compressed_tokens_total += compressed_tokens

query_metrics.append({

"query": q,

"latency_ms": round(q_latency_ms, 2),

"fts_hits": getattr(retrieval, "fts_hits", 0),

"vector_hits": getattr(retrieval, "vector_hits", 0),

"selected_chunks": getattr(result, "selected_chunks", 0),

Copilot · 2026-04-03T01:29:25Z

+    with Progress(SpinnerColumn(), TextColumn("{task.description}"), console=err) as progress:
+        task = progress.add_task(f"Running {len(test_queries)} benchmark queries…")
+        for q in test_queries:


task is assigned but never used in this progress block, which will fail ruff lint (F841). Either remove the variable or use it (e.g., update progress while iterating).

Copilot · 2026-04-03T01:29:25Z

+            stored_mtime, stored_hash = indexed[source]
+            try:
+                current_mtime = path.stat().st_mtime
+                if stored_mtime is None or current_mtime > stored_mtime:
+                    modified_files.append(path)


stored_hash is assigned but never used, which will fail ruff lint (F841). If hash-based change detection is planned, incorporate it into the modified-file decision; otherwise, drop the variable from the unpacking.

Copilot · 2026-04-03T01:29:25Z

+    - Stale session memory entries
+
+    Returns statistics about what was (or would be) cleaned.
+    """
+    stats = {
+        "orphaned_chunks": 0,
+        "orphaned_embeddings": 0,
+        "expired_cache_entries": 0,
+        "stale_query_embeddings": 0,
+        "orphaned_symbols": 0,
+        "orphaned_dependencies": 0,
+    }


garbage_collect()'s docstring and stats include cleanup of "stale session memory entries" and stale_query_embeddings, plus a max_cache_age_seconds parameter, but none of these are implemented/used in the function body. Either implement the stale-cleanup logic (and use max_cache_age_seconds) or remove/adjust the docs + stats fields to match actual behavior.

Suggested change

- Stale session memory entries

Returns statistics about what was (or would be) cleaned.

"""

stats = {

"orphaned_chunks": 0,

"orphaned_embeddings": 0,

"expired_cache_entries": 0,

"stale_query_embeddings": 0,

"orphaned_symbols": 0,

"orphaned_dependencies": 0,

}

- Orphaned symbols

- Orphaned dependencies

Returns statistics about what was (or would be) cleaned.

Note:

- ``max_cache_age_seconds`` is accepted for API compatibility, but this

function currently relies on the database's cache-expiration logic

rather than using that value directly.

"""

stats = {

"orphaned_chunks": 0,

"orphaned_embeddings": 0,

"expired_cache_entries": 0,

"orphaned_symbols": 0,

"orphaned_dependencies": 0,

}

_ = max_cache_age_seconds

Copilot · 2026-04-03T01:29:25Z

+            stored_mtime, stored_hash = indexed[source]
+            try:
+                current_mtime = path.stat().st_mtime
+                if stored_mtime is None or current_mtime > stored_mtime:


detect_file_changes() treats stored_mtime is None as modified, but file_mtime is never set during the normal index_corpus()/upsert_document() flow. This means the first sync after an index will likely re-index everything. Consider writing file_mtime (and/or file_hash) during indexing/upsert so unchanged files can be detected reliably.

Suggested change

if stored_mtime is None or current_mtime > stored_mtime:

if stored_mtime is not None:

if current_mtime > stored_mtime:

modified_files.append(path)

elif stored_hash:

current_text = read_text_file(path)

current_hash = hash_text(current_text)

if current_hash != stored_hash:

modified_files.append(path)

else:

Architectural change

50d8d0d

Copilot AI review requested due to automatic review settings April 3, 2026 01:23

Copilot started reviewing on behalf of Madhan230205 April 3, 2026 01:24 View session

qodo-code-review bot reviewed Apr 3, 2026

View reviewed changes

Copilot AI reviewed Apr 3, 2026

View reviewed changes

Architectural change

cf02ada

Madhan230205 merged commit 65cb3f8 into main Apr 3, 2026
1 of 3 checks passed

-                if stored_mtime is None or current_mtime > stored_mtime:
+                if stored_mtime is not None:
+                    if current_mtime > stored_mtime:
+                        modified_files.append(path)
+                elif stored_hash:
+                    current_text = read_text_file(path)
+                    current_hash = hash_text(current_text)
+                    if current_hash != stored_hash:
+                        modified_files.append(path)
+                else:

Conversation

Madhan230205 commented Apr 3, 2026

Uh oh!

qodo-code-review bot commented Apr 3, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

qodo-code-review bot commented Apr 3, 2026

CI Feedback 🧐

Uh oh!

qodo-code-review bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qodo-code-review bot commented Apr 3, 2026 •

edited

Loading