Skip to content

Architectural change#1

Merged
Madhan230205 merged 2 commits intomainfrom
test
Apr 3, 2026
Merged

Architectural change#1
Madhan230205 merged 2 commits intomainfrom
test

Conversation

@Madhan230205
Copy link
Copy Markdown
Owner

No description provided.

Copilot AI review requested due to automatic review settings April 3, 2026 01:23
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Migrate to typer CLI, add Pydantic models, and implement context lifecycle management

✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
• Migrate CLI from argparse to typer for modern async support
• Add Pydantic models for type-safe data structures
• Implement context lifecycle management commands (sync, gc, stats)
• Add comprehensive test suite for chunker, db, and embeddings modules
• Add project configuration and CI/CD workflows
Diagram
flowchart LR
  argparse["argparse CLI"] -->|migrate| typer["typer Framework"]
  dataclass["dataclass Models"] -->|convert| pydantic["Pydantic BaseModel"]
  typer -->|new commands| sync["sync Command"]
  typer -->|new commands| gc["gc Command"]
  typer -->|new commands| stats["stats Command"]
  db["db Module"] -->|add functions| lifecycle["Lifecycle Management"]
  lifecycle -->|detect| changes["File Changes"]
  lifecycle -->|cleanup| garbage["Garbage Collection"]
  compressor["compressor Module"] -->|add| floor["Relevance Floor"]
  floor -->|filter| quality["Low-Quality Chunks"]
  tests["Test Suite"] -->|cover| modules["chunker, db, embeddings"]
  config["pyproject.toml"] -->|enable| packaging["PyPI Distribution"]
  ci["CI/CD Workflows"] -->|automate| checks["Lint, Test, Publish"]
Loading

Grey Divider

File Changes

1. scripts/token_reducer/cli.py ✨ Enhancement +645/-425

Replace argparse with typer CLI framework

scripts/token_reducer/cli.py


2. scripts/token_reducer/models.py ✨ Enhancement +161/-3

Convert dataclasses to Pydantic BaseModel

scripts/token_reducer/models.py


3. scripts/token_reducer/compressor.py ✨ Enhancement +78/-46

Add relevance floor and Pydantic model returns

scripts/token_reducer/compressor.py


View more (13)
4. scripts/token_reducer/pipeline.py ✨ Enhancement +33/-22

Update pipeline to use Pydantic ContextPacket

scripts/token_reducer/pipeline.py


5. scripts/token_reducer/db.py ✨ Enhancement +272/-1

Add lifecycle management and sync functions

scripts/token_reducer/db.py


6. scripts/token_reducer/config.py ✨ Enhancement +2/-1

Add relevance floor and expand top-k pool

scripts/token_reducer/config.py


7. scripts/token_reducer/chunker.py ✨ Enhancement +3/-2

Refactor import resolution path handling

scripts/token_reducer/chunker.py


8. scripts/token_reducer/retriever.py ✨ Enhancement +3/-2

Remove bounded top-k restriction for filtering

scripts/token_reducer/retriever.py


9. tests/test_chunker.py 🧪 Tests +359/-0

Comprehensive unit tests for chunker module

tests/test_chunker.py


10. tests/test_db.py 🧪 Tests +342/-0

Comprehensive unit tests for database module

tests/test_db.py


11. tests/test_embeddings.py 🧪 Tests +146/-0

Comprehensive unit tests for embeddings module

tests/test_embeddings.py


12. pyproject.toml ⚙️ Configuration changes +85/-0

Add project metadata and build configuration

pyproject.toml


13. .github/workflows/ci.yml ⚙️ Configuration changes +36/-0

Add GitHub Actions CI pipeline

.github/workflows/ci.yml


14. .github/workflows/publish.yml ⚙️ Configuration changes +39/-0

Add GitHub Actions PyPI publish workflow

.github/workflows/publish.yml


15. Makefile ⚙️ Configuration changes +33/-0

Add development task automation

Makefile


16. tests/__init__.py Additional files +0/-0

...

tests/init.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 3, 2026

Code Review by Qodo

🐞 Bugs (4) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX Issues (0)

Grey Divider


Action required

1. Benchmark result type crash 🐞 Bug ≡ Correctness
Description
cli.benchmark() calls result.get(...) on the return value of run_retrieval_pipeline(), but that
function now returns a ContextPacket Pydantic model. The benchmark command will raise AttributeError
and abort on its first query.
Code

scripts/token_reducer/cli.py[R557-566]

+            compressed_tokens = result.get("token_metrics", {}).get("compressed_tokens", 0)
+            selected_tokens = result.get("token_metrics", {}).get("selected_chunk_tokens", 0)
+            compressed_tokens_total += compressed_tokens
+            query_metrics.append({
+                "query": q,
+                "latency_ms": round(q_latency_ms, 2),
+                "fts_hits": result.get("retrieval", {}).get("fts_hits", 0),
+                "vector_hits": result.get("retrieval", {}).get("vector_hits", 0),
+                "selected_chunks": result.get("selected_chunks", 0),
+                "selected_tokens": selected_tokens,
Evidence
benchmark() treats the pipeline result as a dict via .get(), but run_retrieval_pipeline is defined
to return a ContextPacket model, which does not implement dict-style .get().

scripts/token_reducer/cli.py[535-569]
scripts/token_reducer/pipeline.py[44-62]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`benchmark()` assumes `run_retrieval_pipeline()` returns a dict and uses `result.get(...)`, but it now returns a `ContextPacket` model. This crashes benchmarking with `AttributeError`.
## Issue Context
`run_retrieval_pipeline(...)-> ContextPacket` and other CLI commands (e.g., `query`, `run`) already access `result.packet` / `result.model_dump()`.
## Fix Focus Areas
- scripts/token_reducer/cli.py[535-569]
## What to change
- Replace dict-style access with model access, e.g.:
- `compressed_tokens = result.token_metrics.compressed_tokens`
- `selected_tokens = result.token_metrics.selected_chunk_tokens`
- `fts_hits = result.retrieval.fts_hits`
- `vector_hits = result.retrieval.vector_hits`
- `selected_chunks = result.selected_chunks`
- Alternatively: `r = result.model_dump()` once per query and keep the existing dict-based extraction from `r`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Sync deletes unrelated documents 🐞 Bug ≡ Correctness
Description
detect_file_changes() marks any indexed source not present in the current expanded --inputs list as
“deleted”, and sync() removes those documents immediately. Syncing a subset of paths can therefore
wipe unrelated indexed documents/chunks/embeddings from the DB.
Code

scripts/token_reducer/db.py[R776-780]

+    # Find deleted files (in index but not in provided file list)
+    current_sources = {str(p) for p in file_paths}
+    deleted_sources = [s for s in indexed_sources if s not in current_sources]
+
+    return new_files, modified_files, deleted_sources
Evidence
Deletion detection is implemented as a set difference between all indexed sources and the current
file_paths argument, not as a filesystem existence check. sync() passes only the current
collect_input_files(inputs) expansion and then calls remove_documents_by_source() on the
computed deletions.

scripts/token_reducer/db.py[747-780]
scripts/token_reducer/cli.py[655-689]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`sync` can delete indexed documents that are simply not included in the current `--inputs` set (not actually deleted from disk). This is destructive and surprising when users sync only a subset of their project.
## Issue Context
`detect_file_changes()` currently computes `deleted_sources` as `indexed_sources - current_sources`, where `current_sources` is derived only from the provided `file_paths` argument.
## Fix Focus Areas
- scripts/token_reducer/db.py[747-780]
- scripts/token_reducer/cli.py[655-690]
## What to change
Choose one of these safe semantics:
1) **Filesystem-based deletion** (recommended default):
- Compute deletions by checking whether each indexed `source` still exists on disk (`Path(source).exists()`), not by absence from the current input list.
- Optionally restrict this check to sources under the provided inputs’ directories.
2) **Opt-in pruning**:
- Add a `--prune-missing/--no-prune` flag (default `False`).
- Only call `remove_documents_by_source` when pruning is enabled.
3) **Persist sync scope**:
- Persist the root inputs used to build the DB and only prune within that scope.
Also update help text to match the chosen behavior.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. file_mtime not set on index 🐞 Bug ☼ Reliability
Description
Sync change detection relies on documents.file_mtime, but normal indexing
(upsert_document/index_corpus) never populates file_mtime, so stored_mtime is NULL. This causes the
first sync after an index to treat all previously indexed files as modified and reindex them,
reducing incremental-sync value.
Code

scripts/token_reducer/db.py[R767-772]

+            stored_mtime, stored_hash = indexed[source]
+            try:
+                current_mtime = path.stat().st_mtime
+                if stored_mtime is None or current_mtime > stored_mtime:
+                    modified_files.append(path)
+            except OSError:
Evidence
detect_file_changes() flags a file as modified when stored_mtime is None. upsert_document inserts
documents without setting file_mtime, and index_corpus doesn’t update it either, so freshly indexed
DBs will have NULL mtimes until sync explicitly updates them.

scripts/token_reducer/db.py[762-772]
scripts/token_reducer/db.py[366-373]
scripts/token_reducer/cli.py[716-720]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`documents.file_mtime` is used for incremental sync, but it is not written during `index`/`run` indexing paths. This makes the first `sync` after indexing do a full reindex.
## Issue Context
- `detect_file_changes`: `if stored_mtime is None or current_mtime > stored_mtime: modified_files.append(path)`
- `upsert_document` inserts into `documents` without `file_mtime`
- `sync` does call `update_document_mtime`, but only for files reindexed in sync.
## Fix Focus Areas
- scripts/token_reducer/db.py[335-454]
- scripts/token_reducer/db.py[945-961]
## What to change
- Option A (simple): in `index_corpus`, after each `upsert_document(...)`, call `update_document_mtime(conn, source=str(path), mtime=path.stat().st_mtime)`.
- Option B: extend `upsert_document(..., file_mtime: float | None = None, file_hash: str | None = None)` and set these columns on INSERT/UPDATE.
Ensure the value is set for both insert and update cases.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

4. Unpinned GitHub actions 🐞 Bug ⛨ Security
Description
Workflows use mutable action tags (e.g., actions/checkout@v4,
pypa/gh-action-pypi-publish@release/v1) rather than immutable commit SHAs. This increases CI/CD
supply-chain risk, especially for the PyPI publish workflow.
Code

.github/workflows/publish.yml[R13-15]

+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
Evidence
Both CI and publish workflows reference third-party actions by moving tags, which can be retargeted
upstream.

.github/workflows/ci.yml[13-16]
.github/workflows/publish.yml[13-16]
.github/workflows/publish.yml[39-39]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
GitHub Actions are referenced by mutable tags; pinning to commit SHAs reduces the risk of supply-chain attacks via upstream tag retargeting.
## Issue Context
This is most important for the publish workflow that can release artifacts.
## Fix Focus Areas
- .github/workflows/ci.yml[13-31]
- .github/workflows/publish.yml[13-39]
## What to change
- Replace `uses: owner/action@vX` with `uses: owner/action@<full_commit_sha>` (optionally keep a comment with the human version).
- Do the same for `pypa/gh-action-pypi-publish@release/v1` and artifact actions.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

CI Feedback 🧐

A test triggered by this PR failed. Here is an AI-generated analysis of the failure:

Action: Lint (ruff)

Failed stage: Run ruff check scripts/ tests/ [❌]

Failed test name: ""

Failure summary:

The action failed because the linting/formatting step (Ruff) reported code quality violations and
exited with status 1.
- Ruff found 49 errors across multiple files (e.g., scripts/apply_diff.py,
scripts/token_reducer/chunker.py, scripts/token_reducer/db.py, tests/test_embeddings.py), including:

- Unused variables (e.g., B007 unused loop variable b at scripts/apply_diff.py:413:17 and :421:17;
F841 unused applied_before at scripts/apply_diff.py:462:13).
- Unused imports (e.g., F401
tree_sitter unused at scripts/token_reducer/chunker.py:406:16; .models.embedding_cache_key unused at
scripts/token_reducer/db.py:15:5).
- Import sorting/formatting issues (I001 in several files,
including scripts/token_reducer/chunker.py:423:9, scripts/token_reducer/cli.py:17:1, and
tests/test_embeddings.py:1:1).
- Style recommendation (SIM105) to replace try/except/pass with
contextlib.suppress at scripts/token_reducer/db.py:146:13.
- The job ended at Process completed with
exit code 1 because these Ruff errors were not fixed (though 37 are auto-fixable with --fix).

Relevant error logs:
1:  ##[group]Runner Image Provisioner
2:  Hosted Compute Agent
...

174:  219 | ) -> tuple[bool, str, str]:
175:  220 |     if block.search not in content:
176:  221 |         return False, f"Search text not found", content
177:  |                       ^^^^^^^^^^^^^^^^^^^^^^^^
178:  222 |
179:  223 |     occurrences = content.count(block.search)
180:  |
181:  help: Remove extraneous `f` prefix
182:  B007 Loop control variable `b` not used within loop body
183:  --> scripts/apply_diff.py:413:17
184:  |
185:  411 |     for target, file_blocks in file_groups.items():
186:  412 |         if not target.exists():
187:  413 |             for b in file_blocks:
188:  |                 ^
189:  414 |                 results["failed"] += 1
190:  415 |                 results["messages"].append(f"File not found: {target}")
191:  |
192:  help: Rename unused `b` to `_b`
193:  B007 Loop control variable `b` not used within loop body
194:  --> scripts/apply_diff.py:421:17
195:  |
196:  419 |             original = target.read_text(encoding="utf-8")
197:  420 |         except Exception as exc:
198:  421 |             for b in file_blocks:
199:  |                 ^
200:  422 |                 results["failed"] += 1
201:  423 |             results["messages"].append(f"Failed to read {target}: {exc}")
202:  |
203:  help: Rename unused `b` to `_b`
204:  F841 Local variable `applied_before` is assigned to but never used
205:  --> scripts/apply_diff.py:462:13
206:  |
207:  460 | …     else:
208:  461 | …         # Entire file transaction failed — original untouched on disk
209:  462 | …         applied_before = sum(1 for m in tx_messages if "rolled back" not in m and "not found" not in m and not m.endswith(tx_messag…
210:  |           ^^^^^^^^^^^^^^
211:  463 | …         results["failed"] += len(file_blocks)
212:  464 | …         results["messages"].extend(tx_messages)
...

314:  6 | from pathlib import Path
315:  7 | from typing import Sequence
316:  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
317:  8 |
318:  9 | from .config import (
319:  |
320:  help: Import from `collections.abc`
321:  F401 `tree_sitter` imported but unused; consider using `importlib.util.find_spec` to test for availability
322:  --> scripts/token_reducer/chunker.py:406:16
323:  |
324:  404 |         return _TREE_SITTER_AVAILABLE
325:  405 |     try:
326:  406 |         import tree_sitter  # type: ignore
327:  |                ^^^^^^^^^^^
328:  407 |         _TREE_SITTER_AVAILABLE = True
329:  408 |     except ImportError:
330:  |
331:  help: Remove unused import: `tree_sitter`
332:  I001 [*] Import block is un-sorted or un-formatted
333:  --> scripts/token_reducer/chunker.py:423:9
334:  |
335:  422 |       try:
336:  423 | /         import tree_sitter  # type: ignore
337:  424 | |         import importlib
338:  | |________________________^
339:  425 |           grammar = importlib.import_module(grammar_module)
340:  |
341:  help: Organize imports
342:  I001 [*] Import block is un-sorted or un-formatted
343:  --> scripts/token_reducer/cli.py:17:1
344:  |
345:  15 |       sys.stdout.reconfigure(encoding="utf-8", errors="replace")
346:  16 |
...

611:  9 |
612:  10 | from .models import (
613:  |
614:  help: Import from `collections.abc`
615:  F401 [*] `.models.embedding_cache_key` imported but unused
616:  --> scripts/token_reducer/db.py:15:5
617:  |
618:  13 |     utc_now_epoch,
619:  14 |     hash_text,
620:  15 |     embedding_cache_key,
621:  |     ^^^^^^^^^^^^^^^^^^^
622:  16 | )
623:  17 | from .chunker import (
624:  |
625:  help: Remove unused import: `.models.embedding_cache_key`
626:  SIM105 Use `contextlib.suppress(sqlite3.OperationalError)` instead of `try`-`except`-`pass`
627:  --> scripts/token_reducer/db.py:146:13
628:  |
629:  144 |               "ALTER TABLE documents ADD COLUMN file_hash TEXT",
630:  145 |           ):
631:  146 | /             try:
632:  147 | |                 conn.execute(statement)
633:  148 | |             except sqlite3.OperationalError:
634:  149 | |                 pass
635:  | |____________________^
636:  150 |
637:  151 |           conn.execute(
638:  |
639:  help: Replace `try`-`except`-`pass` with `with contextlib.suppress(sqlite3.OperationalError): ...`
640:  I001 [*] Import block is un-sorted or un-formatted
...

865:  --> tests/test_embeddings.py:1:1
866:  |
867:  1 | / from __future__ import annotations
868:  2 | |
869:  3 | | import math
870:  4 | |
871:  5 | | from token_reducer.embeddings import (
872:  6 | |     cosine_similarity,
873:  7 | |     embed_text,
874:  8 | |     embed_text_hash,
875:  9 | |     resolve_embedding_backend,
876:  10 | | )
877:  | |_^
878:  |
879:  help: Organize imports
880:  Found 49 errors.
881:  [*] 37 fixable with the `--fix` option (10 hidden fixes can be enabled with the `--unsafe-fixes` option).
882:  ##[error]Process completed with exit code 1.
883:  Post job cleanup.

Comment thread scripts/token_reducer/cli.py Outdated
Comment on lines +557 to +566
compressed_tokens = result.get("token_metrics", {}).get("compressed_tokens", 0)
selected_tokens = result.get("token_metrics", {}).get("selected_chunk_tokens", 0)
compressed_tokens_total += compressed_tokens
query_metrics.append({
"query": q,
"latency_ms": round(q_latency_ms, 2),
"fts_hits": result.get("retrieval", {}).get("fts_hits", 0),
"vector_hits": result.get("retrieval", {}).get("vector_hits", 0),
"selected_chunks": result.get("selected_chunks", 0),
"selected_tokens": selected_tokens,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Benchmark result type crash 🐞 Bug ≡ Correctness

cli.benchmark() calls result.get(...) on the return value of run_retrieval_pipeline(), but that
function now returns a ContextPacket Pydantic model. The benchmark command will raise AttributeError
and abort on its first query.
Agent Prompt
## Issue description
`benchmark()` assumes `run_retrieval_pipeline()` returns a dict and uses `result.get(...)`, but it now returns a `ContextPacket` model. This crashes benchmarking with `AttributeError`.

## Issue Context
`run_retrieval_pipeline(...)-> ContextPacket` and other CLI commands (e.g., `query`, `run`) already access `result.packet` / `result.model_dump()`.

## Fix Focus Areas
- scripts/token_reducer/cli.py[535-569]

## What to change
- Replace dict-style access with model access, e.g.:
  - `compressed_tokens = result.token_metrics.compressed_tokens`
  - `selected_tokens = result.token_metrics.selected_chunk_tokens`
  - `fts_hits = result.retrieval.fts_hits`
  - `vector_hits = result.retrieval.vector_hits`
  - `selected_chunks = result.selected_chunks`
- Alternatively: `r = result.model_dump()` once per query and keep the existing dict-based extraction from `r`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +776 to +780
# Find deleted files (in index but not in provided file list)
current_sources = {str(p) for p in file_paths}
deleted_sources = [s for s in indexed_sources if s not in current_sources]

return new_files, modified_files, deleted_sources
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Sync deletes unrelated documents 🐞 Bug ≡ Correctness

detect_file_changes() marks any indexed source not present in the current expanded --inputs list as
“deleted”, and sync() removes those documents immediately. Syncing a subset of paths can therefore
wipe unrelated indexed documents/chunks/embeddings from the DB.
Agent Prompt
## Issue description
`sync` can delete indexed documents that are simply not included in the current `--inputs` set (not actually deleted from disk). This is destructive and surprising when users sync only a subset of their project.

## Issue Context
`detect_file_changes()` currently computes `deleted_sources` as `indexed_sources - current_sources`, where `current_sources` is derived only from the provided `file_paths` argument.

## Fix Focus Areas
- scripts/token_reducer/db.py[747-780]
- scripts/token_reducer/cli.py[655-690]

## What to change
Choose one of these safe semantics:
1) **Filesystem-based deletion** (recommended default):
   - Compute deletions by checking whether each indexed `source` still exists on disk (`Path(source).exists()`), not by absence from the current input list.
   - Optionally restrict this check to sources under the provided inputs’ directories.
2) **Opt-in pruning**:
   - Add a `--prune-missing/--no-prune` flag (default `False`).
   - Only call `remove_documents_by_source` when pruning is enabled.
3) **Persist sync scope**:
   - Persist the root inputs used to build the DB and only prune within that scope.

Also update help text to match the chosen behavior.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the token-reducer pipeline toward a more structured, “package-like” architecture: migrating key payloads to Pydantic models, switching the CLI to Typer/Rich, expanding retrieval defaults, and adding index lifecycle utilities (sync/gc/stats) plus a new test suite.

Changes:

  • Introduce Pydantic models for retrieval candidates and the pipeline’s ContextPacket, updating pipeline/compressor code paths accordingly.
  • Replace the argparse CLI with a Typer-based CLI and add lifecycle commands (sync, gc, stats) and improved UX (progress output).
  • Add comprehensive pytest coverage for chunking, embeddings, and DB behavior; adjust retrieval defaults (e.g., DEFAULT_TOP_K=50) and add a relevance floor for compression.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
scripts/token_reducer/models.py Replace dataclass usage with Pydantic models (ContextPacket and related types).
scripts/token_reducer/pipeline.py Return ContextPacket, add relevance-floor support, and adapt cache handling.
scripts/token_reducer/compressor.py Add relevance floor to compression and return a typed ContextPacket from build_packet.
scripts/token_reducer/cli.py Migrate to Typer/Rich; add lifecycle commands; update output handling for model-based results.
scripts/token_reducer/db.py Extend schema for mtime/hash and add lifecycle helpers (stats/sync/gc utilities).
scripts/token_reducer/retriever.py Remove top-k bounding to pass full candidate pool downstream.
scripts/token_reducer/config.py Increase default top_k and introduce DEFAULT_RELEVANCE_FLOOR.
scripts/token_reducer/chunker.py Minor import-resolution refactor.
tests/test_chunker.py Add unit tests for chunking/tokenization/import parsing utilities.
tests/test_db.py Add unit tests for DB schema, caching, session memory, indexing behavior.
tests/test_embeddings.py Add unit tests for embedding backends and cosine similarity behavior.
pyproject.toml Add packaging + tooling configuration (hatch, ruff, mypy, pytest).
Makefile Add common dev commands (test/lint/format/typecheck/check).
.github/workflows/ci.yml Add CI for ruff + pytest across Python 3.11/3.12.
.github/workflows/publish.yml Add tag-based PyPI publishing workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/token_reducer/models.py Outdated
from datetime import datetime, timezone
from hashlib import blake2b
from pathlib import Path
from typing import Any
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any is imported but never used, which will fail the repo's ruff lint step (F401). Remove the import or use it in a type annotation.

Suggested change
from typing import Any

Copilot uses AI. Check for mistakes.
Comment thread scripts/token_reducer/compressor.py Outdated
Comment on lines +10 to +13
CacheInfo,
ContextPacket,
RetrievalInfo,
SessionMemory,
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CacheInfo and SessionMemory are imported but not used anywhere in this module, which will fail ruff lint (F401). Please drop these unused imports (or reference them explicitly if intended for typing).

Suggested change
CacheInfo,
ContextPacket,
RetrievalInfo,
SessionMemory,
ContextPacket,
RetrievalInfo,

Copilot uses AI. Check for mistakes.
Comment on lines 59 to +61
word_budget: int,
) -> dict:
relevance_floor: float = DEFAULT_RELEVANCE_FLOOR,
) -> ContextPacket:
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relevance_floor is now an input to run_retrieval_pipeline, but it is not included in the query-cache key material. If callers pass a non-default relevance_floor, cached packets can be reused across different floors and produce inconsistent results. Add relevance_floor to cache_key_material.

Copilot uses AI. Check for mistakes.
Comment on lines +96 to +98
# Reconstruct ContextPacket from cached dict
cached_packet = ContextPacket.model_validate(cached_result)
cached_packet.session_memory = SessionMemory(
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContextPacket.model_validate(cached_result) can raise a ValidationError if the on-disk cache payload is from an older schema or is corrupted. Right now that exception will abort the pipeline on an otherwise recoverable cache hit. Wrap validation in a try/except and treat invalid cache entries as a miss (delete + recompute).

Copilot uses AI. Check for mistakes.
Comment thread scripts/token_reducer/cli.py Outdated
Comment on lines +557 to +565
compressed_tokens = result.get("token_metrics", {}).get("compressed_tokens", 0)
selected_tokens = result.get("token_metrics", {}).get("selected_chunk_tokens", 0)
compressed_tokens_total += compressed_tokens
query_metrics.append({
"query": q,
"latency_ms": round(q_latency_ms, 2),
"fts_hits": result.get("retrieval", {}).get("fts_hits", 0),
"vector_hits": result.get("retrieval", {}).get("vector_hits", 0),
"selected_chunks": result.get("selected_chunks", 0),
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_retrieval_pipeline now returns a ContextPacket model, but this code still treats result like a dict via .get(...). This will raise at runtime. Use result.token_metrics.compressed_tokens, result.retrieval.fts_hits, result.selected_chunks, etc. (or call result.model_dump() once and work with the dict).

Suggested change
compressed_tokens = result.get("token_metrics", {}).get("compressed_tokens", 0)
selected_tokens = result.get("token_metrics", {}).get("selected_chunk_tokens", 0)
compressed_tokens_total += compressed_tokens
query_metrics.append({
"query": q,
"latency_ms": round(q_latency_ms, 2),
"fts_hits": result.get("retrieval", {}).get("fts_hits", 0),
"vector_hits": result.get("retrieval", {}).get("vector_hits", 0),
"selected_chunks": result.get("selected_chunks", 0),
token_metrics = getattr(result, "token_metrics", None)
retrieval = getattr(result, "retrieval", None)
compressed_tokens = getattr(token_metrics, "compressed_tokens", 0)
selected_tokens = getattr(token_metrics, "selected_chunk_tokens", 0)
compressed_tokens_total += compressed_tokens
query_metrics.append({
"query": q,
"latency_ms": round(q_latency_ms, 2),
"fts_hits": getattr(retrieval, "fts_hits", 0),
"vector_hits": getattr(retrieval, "vector_hits", 0),
"selected_chunks": getattr(result, "selected_chunks", 0),

Copilot uses AI. Check for mistakes.
Comment on lines +535 to +537
with Progress(SpinnerColumn(), TextColumn("{task.description}"), console=err) as progress:
task = progress.add_task(f"Running {len(test_queries)} benchmark queries…")
for q in test_queries:
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task is assigned but never used in this progress block, which will fail ruff lint (F841). Either remove the variable or use it (e.g., update progress while iterating).

Copilot uses AI. Check for mistakes.
Comment on lines +767 to +771
stored_mtime, stored_hash = indexed[source]
try:
current_mtime = path.stat().st_mtime
if stored_mtime is None or current_mtime > stored_mtime:
modified_files.append(path)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stored_hash is assigned but never used, which will fail ruff lint (F841). If hash-based change detection is planned, incorporate it into the modified-file decision; otherwise, drop the variable from the unpacking.

Copilot uses AI. Check for mistakes.
Comment on lines +824 to +835
- Stale session memory entries

Returns statistics about what was (or would be) cleaned.
"""
stats = {
"orphaned_chunks": 0,
"orphaned_embeddings": 0,
"expired_cache_entries": 0,
"stale_query_embeddings": 0,
"orphaned_symbols": 0,
"orphaned_dependencies": 0,
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

garbage_collect()'s docstring and stats include cleanup of "stale session memory entries" and stale_query_embeddings, plus a max_cache_age_seconds parameter, but none of these are implemented/used in the function body. Either implement the stale-cleanup logic (and use max_cache_age_seconds) or remove/adjust the docs + stats fields to match actual behavior.

Suggested change
- Stale session memory entries
Returns statistics about what was (or would be) cleaned.
"""
stats = {
"orphaned_chunks": 0,
"orphaned_embeddings": 0,
"expired_cache_entries": 0,
"stale_query_embeddings": 0,
"orphaned_symbols": 0,
"orphaned_dependencies": 0,
}
- Orphaned symbols
- Orphaned dependencies
Returns statistics about what was (or would be) cleaned.
Note:
- ``max_cache_age_seconds`` is accepted for API compatibility, but this
function currently relies on the database's cache-expiration logic
rather than using that value directly.
"""
stats = {
"orphaned_chunks": 0,
"orphaned_embeddings": 0,
"expired_cache_entries": 0,
"orphaned_symbols": 0,
"orphaned_dependencies": 0,
}
_ = max_cache_age_seconds

Copilot uses AI. Check for mistakes.
stored_mtime, stored_hash = indexed[source]
try:
current_mtime = path.stat().st_mtime
if stored_mtime is None or current_mtime > stored_mtime:
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect_file_changes() treats stored_mtime is None as modified, but file_mtime is never set during the normal index_corpus()/upsert_document() flow. This means the first sync after an index will likely re-index everything. Consider writing file_mtime (and/or file_hash) during indexing/upsert so unchanged files can be detected reliably.

Suggested change
if stored_mtime is None or current_mtime > stored_mtime:
if stored_mtime is not None:
if current_mtime > stored_mtime:
modified_files.append(path)
elif stored_hash:
current_text = read_text_file(path)
current_hash = hash_text(current_text)
if current_hash != stored_hash:
modified_files.append(path)
else:

Copilot uses AI. Check for mistakes.
@Madhan230205 Madhan230205 merged commit 65cb3f8 into main Apr 3, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants