Conversation
Review Summary by QodoMigrate to typer CLI, add Pydantic models, and implement context lifecycle management
WalkthroughsDescription• Migrate CLI from argparse to typer for modern async support • Add Pydantic models for type-safe data structures • Implement context lifecycle management commands (sync, gc, stats) • Add comprehensive test suite for chunker, db, and embeddings modules • Add project configuration and CI/CD workflows Diagramflowchart LR
argparse["argparse CLI"] -->|migrate| typer["typer Framework"]
dataclass["dataclass Models"] -->|convert| pydantic["Pydantic BaseModel"]
typer -->|new commands| sync["sync Command"]
typer -->|new commands| gc["gc Command"]
typer -->|new commands| stats["stats Command"]
db["db Module"] -->|add functions| lifecycle["Lifecycle Management"]
lifecycle -->|detect| changes["File Changes"]
lifecycle -->|cleanup| garbage["Garbage Collection"]
compressor["compressor Module"] -->|add| floor["Relevance Floor"]
floor -->|filter| quality["Low-Quality Chunks"]
tests["Test Suite"] -->|cover| modules["chunker, db, embeddings"]
config["pyproject.toml"] -->|enable| packaging["PyPI Distribution"]
ci["CI/CD Workflows"] -->|automate| checks["Lint, Test, Publish"]
File Changes1. scripts/token_reducer/cli.py
|
Code Review by Qodo
1. Benchmark result type crash
|
CI Feedback 🧐A test triggered by this PR failed. Here is an AI-generated analysis of the failure:
|
| compressed_tokens = result.get("token_metrics", {}).get("compressed_tokens", 0) | ||
| selected_tokens = result.get("token_metrics", {}).get("selected_chunk_tokens", 0) | ||
| compressed_tokens_total += compressed_tokens | ||
| query_metrics.append({ | ||
| "query": q, | ||
| "latency_ms": round(q_latency_ms, 2), | ||
| "fts_hits": result.get("retrieval", {}).get("fts_hits", 0), | ||
| "vector_hits": result.get("retrieval", {}).get("vector_hits", 0), | ||
| "selected_chunks": result.get("selected_chunks", 0), | ||
| "selected_tokens": selected_tokens, |
There was a problem hiding this comment.
1. Benchmark result type crash 🐞 Bug ≡ Correctness
cli.benchmark() calls result.get(...) on the return value of run_retrieval_pipeline(), but that function now returns a ContextPacket Pydantic model. The benchmark command will raise AttributeError and abort on its first query.
Agent Prompt
## Issue description
`benchmark()` assumes `run_retrieval_pipeline()` returns a dict and uses `result.get(...)`, but it now returns a `ContextPacket` model. This crashes benchmarking with `AttributeError`.
## Issue Context
`run_retrieval_pipeline(...)-> ContextPacket` and other CLI commands (e.g., `query`, `run`) already access `result.packet` / `result.model_dump()`.
## Fix Focus Areas
- scripts/token_reducer/cli.py[535-569]
## What to change
- Replace dict-style access with model access, e.g.:
- `compressed_tokens = result.token_metrics.compressed_tokens`
- `selected_tokens = result.token_metrics.selected_chunk_tokens`
- `fts_hits = result.retrieval.fts_hits`
- `vector_hits = result.retrieval.vector_hits`
- `selected_chunks = result.selected_chunks`
- Alternatively: `r = result.model_dump()` once per query and keep the existing dict-based extraction from `r`.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| # Find deleted files (in index but not in provided file list) | ||
| current_sources = {str(p) for p in file_paths} | ||
| deleted_sources = [s for s in indexed_sources if s not in current_sources] | ||
|
|
||
| return new_files, modified_files, deleted_sources |
There was a problem hiding this comment.
2. Sync deletes unrelated documents 🐞 Bug ≡ Correctness
detect_file_changes() marks any indexed source not present in the current expanded --inputs list as “deleted”, and sync() removes those documents immediately. Syncing a subset of paths can therefore wipe unrelated indexed documents/chunks/embeddings from the DB.
Agent Prompt
## Issue description
`sync` can delete indexed documents that are simply not included in the current `--inputs` set (not actually deleted from disk). This is destructive and surprising when users sync only a subset of their project.
## Issue Context
`detect_file_changes()` currently computes `deleted_sources` as `indexed_sources - current_sources`, where `current_sources` is derived only from the provided `file_paths` argument.
## Fix Focus Areas
- scripts/token_reducer/db.py[747-780]
- scripts/token_reducer/cli.py[655-690]
## What to change
Choose one of these safe semantics:
1) **Filesystem-based deletion** (recommended default):
- Compute deletions by checking whether each indexed `source` still exists on disk (`Path(source).exists()`), not by absence from the current input list.
- Optionally restrict this check to sources under the provided inputs’ directories.
2) **Opt-in pruning**:
- Add a `--prune-missing/--no-prune` flag (default `False`).
- Only call `remove_documents_by_source` when pruning is enabled.
3) **Persist sync scope**:
- Persist the root inputs used to build the DB and only prune within that scope.
Also update help text to match the chosen behavior.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
Pull request overview
This PR refactors the token-reducer pipeline toward a more structured, “package-like” architecture: migrating key payloads to Pydantic models, switching the CLI to Typer/Rich, expanding retrieval defaults, and adding index lifecycle utilities (sync/gc/stats) plus a new test suite.
Changes:
- Introduce Pydantic models for retrieval candidates and the pipeline’s
ContextPacket, updating pipeline/compressor code paths accordingly. - Replace the argparse CLI with a Typer-based CLI and add lifecycle commands (
sync,gc,stats) and improved UX (progress output). - Add comprehensive pytest coverage for chunking, embeddings, and DB behavior; adjust retrieval defaults (e.g.,
DEFAULT_TOP_K=50) and add a relevance floor for compression.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/token_reducer/models.py |
Replace dataclass usage with Pydantic models (ContextPacket and related types). |
scripts/token_reducer/pipeline.py |
Return ContextPacket, add relevance-floor support, and adapt cache handling. |
scripts/token_reducer/compressor.py |
Add relevance floor to compression and return a typed ContextPacket from build_packet. |
scripts/token_reducer/cli.py |
Migrate to Typer/Rich; add lifecycle commands; update output handling for model-based results. |
scripts/token_reducer/db.py |
Extend schema for mtime/hash and add lifecycle helpers (stats/sync/gc utilities). |
scripts/token_reducer/retriever.py |
Remove top-k bounding to pass full candidate pool downstream. |
scripts/token_reducer/config.py |
Increase default top_k and introduce DEFAULT_RELEVANCE_FLOOR. |
scripts/token_reducer/chunker.py |
Minor import-resolution refactor. |
tests/test_chunker.py |
Add unit tests for chunking/tokenization/import parsing utilities. |
tests/test_db.py |
Add unit tests for DB schema, caching, session memory, indexing behavior. |
tests/test_embeddings.py |
Add unit tests for embedding backends and cosine similarity behavior. |
pyproject.toml |
Add packaging + tooling configuration (hatch, ruff, mypy, pytest). |
Makefile |
Add common dev commands (test/lint/format/typecheck/check). |
.github/workflows/ci.yml |
Add CI for ruff + pytest across Python 3.11/3.12. |
.github/workflows/publish.yml |
Add tag-based PyPI publishing workflow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from datetime import datetime, timezone | ||
| from hashlib import blake2b | ||
| from pathlib import Path | ||
| from typing import Any |
There was a problem hiding this comment.
Any is imported but never used, which will fail the repo's ruff lint step (F401). Remove the import or use it in a type annotation.
| from typing import Any |
| CacheInfo, | ||
| ContextPacket, | ||
| RetrievalInfo, | ||
| SessionMemory, |
There was a problem hiding this comment.
CacheInfo and SessionMemory are imported but not used anywhere in this module, which will fail ruff lint (F401). Please drop these unused imports (or reference them explicitly if intended for typing).
| CacheInfo, | |
| ContextPacket, | |
| RetrievalInfo, | |
| SessionMemory, | |
| ContextPacket, | |
| RetrievalInfo, |
| word_budget: int, | ||
| ) -> dict: | ||
| relevance_floor: float = DEFAULT_RELEVANCE_FLOOR, | ||
| ) -> ContextPacket: |
There was a problem hiding this comment.
relevance_floor is now an input to run_retrieval_pipeline, but it is not included in the query-cache key material. If callers pass a non-default relevance_floor, cached packets can be reused across different floors and produce inconsistent results. Add relevance_floor to cache_key_material.
| # Reconstruct ContextPacket from cached dict | ||
| cached_packet = ContextPacket.model_validate(cached_result) | ||
| cached_packet.session_memory = SessionMemory( |
There was a problem hiding this comment.
ContextPacket.model_validate(cached_result) can raise a ValidationError if the on-disk cache payload is from an older schema or is corrupted. Right now that exception will abort the pipeline on an otherwise recoverable cache hit. Wrap validation in a try/except and treat invalid cache entries as a miss (delete + recompute).
| compressed_tokens = result.get("token_metrics", {}).get("compressed_tokens", 0) | ||
| selected_tokens = result.get("token_metrics", {}).get("selected_chunk_tokens", 0) | ||
| compressed_tokens_total += compressed_tokens | ||
| query_metrics.append({ | ||
| "query": q, | ||
| "latency_ms": round(q_latency_ms, 2), | ||
| "fts_hits": result.get("retrieval", {}).get("fts_hits", 0), | ||
| "vector_hits": result.get("retrieval", {}).get("vector_hits", 0), | ||
| "selected_chunks": result.get("selected_chunks", 0), |
There was a problem hiding this comment.
run_retrieval_pipeline now returns a ContextPacket model, but this code still treats result like a dict via .get(...). This will raise at runtime. Use result.token_metrics.compressed_tokens, result.retrieval.fts_hits, result.selected_chunks, etc. (or call result.model_dump() once and work with the dict).
| compressed_tokens = result.get("token_metrics", {}).get("compressed_tokens", 0) | |
| selected_tokens = result.get("token_metrics", {}).get("selected_chunk_tokens", 0) | |
| compressed_tokens_total += compressed_tokens | |
| query_metrics.append({ | |
| "query": q, | |
| "latency_ms": round(q_latency_ms, 2), | |
| "fts_hits": result.get("retrieval", {}).get("fts_hits", 0), | |
| "vector_hits": result.get("retrieval", {}).get("vector_hits", 0), | |
| "selected_chunks": result.get("selected_chunks", 0), | |
| token_metrics = getattr(result, "token_metrics", None) | |
| retrieval = getattr(result, "retrieval", None) | |
| compressed_tokens = getattr(token_metrics, "compressed_tokens", 0) | |
| selected_tokens = getattr(token_metrics, "selected_chunk_tokens", 0) | |
| compressed_tokens_total += compressed_tokens | |
| query_metrics.append({ | |
| "query": q, | |
| "latency_ms": round(q_latency_ms, 2), | |
| "fts_hits": getattr(retrieval, "fts_hits", 0), | |
| "vector_hits": getattr(retrieval, "vector_hits", 0), | |
| "selected_chunks": getattr(result, "selected_chunks", 0), |
| with Progress(SpinnerColumn(), TextColumn("{task.description}"), console=err) as progress: | ||
| task = progress.add_task(f"Running {len(test_queries)} benchmark queries…") | ||
| for q in test_queries: |
There was a problem hiding this comment.
task is assigned but never used in this progress block, which will fail ruff lint (F841). Either remove the variable or use it (e.g., update progress while iterating).
| stored_mtime, stored_hash = indexed[source] | ||
| try: | ||
| current_mtime = path.stat().st_mtime | ||
| if stored_mtime is None or current_mtime > stored_mtime: | ||
| modified_files.append(path) |
There was a problem hiding this comment.
stored_hash is assigned but never used, which will fail ruff lint (F841). If hash-based change detection is planned, incorporate it into the modified-file decision; otherwise, drop the variable from the unpacking.
| - Stale session memory entries | ||
|
|
||
| Returns statistics about what was (or would be) cleaned. | ||
| """ | ||
| stats = { | ||
| "orphaned_chunks": 0, | ||
| "orphaned_embeddings": 0, | ||
| "expired_cache_entries": 0, | ||
| "stale_query_embeddings": 0, | ||
| "orphaned_symbols": 0, | ||
| "orphaned_dependencies": 0, | ||
| } |
There was a problem hiding this comment.
garbage_collect()'s docstring and stats include cleanup of "stale session memory entries" and stale_query_embeddings, plus a max_cache_age_seconds parameter, but none of these are implemented/used in the function body. Either implement the stale-cleanup logic (and use max_cache_age_seconds) or remove/adjust the docs + stats fields to match actual behavior.
| - Stale session memory entries | |
| Returns statistics about what was (or would be) cleaned. | |
| """ | |
| stats = { | |
| "orphaned_chunks": 0, | |
| "orphaned_embeddings": 0, | |
| "expired_cache_entries": 0, | |
| "stale_query_embeddings": 0, | |
| "orphaned_symbols": 0, | |
| "orphaned_dependencies": 0, | |
| } | |
| - Orphaned symbols | |
| - Orphaned dependencies | |
| Returns statistics about what was (or would be) cleaned. | |
| Note: | |
| - ``max_cache_age_seconds`` is accepted for API compatibility, but this | |
| function currently relies on the database's cache-expiration logic | |
| rather than using that value directly. | |
| """ | |
| stats = { | |
| "orphaned_chunks": 0, | |
| "orphaned_embeddings": 0, | |
| "expired_cache_entries": 0, | |
| "orphaned_symbols": 0, | |
| "orphaned_dependencies": 0, | |
| } | |
| _ = max_cache_age_seconds |
| stored_mtime, stored_hash = indexed[source] | ||
| try: | ||
| current_mtime = path.stat().st_mtime | ||
| if stored_mtime is None or current_mtime > stored_mtime: |
There was a problem hiding this comment.
detect_file_changes() treats stored_mtime is None as modified, but file_mtime is never set during the normal index_corpus()/upsert_document() flow. This means the first sync after an index will likely re-index everything. Consider writing file_mtime (and/or file_hash) during indexing/upsert so unchanged files can be detected reliably.
| if stored_mtime is None or current_mtime > stored_mtime: | |
| if stored_mtime is not None: | |
| if current_mtime > stored_mtime: | |
| modified_files.append(path) | |
| elif stored_hash: | |
| current_text = read_text_file(path) | |
| current_hash = hash_text(current_text) | |
| if current_hash != stored_hash: | |
| modified_files.append(path) | |
| else: |
No description provided.