Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a hybrid “keywords-first” contract classification approach by generating a new semantic anchor registry (keywords per CCS category) and using a heuristic classifier with fallback to the existing LLM/RAG approach when confidence is low.
Changes:
- Added semantic-anchor extraction tooling (LLM-driven) and committed a new registry (
semantic_anchors4.json) used by a heuristic classifier. - Added heuristic classifier (v5) and “mix” wrappers to route between heuristic vs LLM/RAG classification.
- Expanded prompts and updated dependency management to a
pip-compile-generated, fully pinnedrequirements.txt.
Reviewed changes
Copilot reviewed 20 out of 24 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| src/utils/weight_keywords_tool_v3.py | Generates semantic anchors per category and writes semantic_anchors4.json |
| src/utils/RAG_system_v2.py | Builds/embeds category “rule cards” and uploads to Azure Search |
| src/utils/llm_keywords_finder.py | LLM keyword extraction helper used by the anchor generator |
| src/core/classification_v5.py | Heuristic keyword-based classifier over semantic_anchors4.json |
| src/core/classification_v3.py | RAG-backed mapper (Azure Search retrieval + LLM) |
| src/core/classification_v3_mix_v5.py | Hybrid wrapper combining v5 heuristic + v3 RAG fallback |
| src/core/classification_v2_mix_v5.py | Hybrid wrapper combining v5 heuristic + v2 LLM fallback |
| src/core/classification_v2.py | Minor import/path cleanup |
| src/core/classification_v1.py | Updated LangChain message imports + import path |
| semantic_anchors4.json | New semantic anchor registry used by v5 heuristic classifier |
| requirements.in | New pip-compile input set for dependencies |
| requirements.txt | New fully pinned dependency lockfile |
| prompts/keyword_system_prompt_v2.md | New keyword extraction prompt template |
| prompts/summariser_system_prompt.md | New rule-card summarisation prompt template |
| prompts/system_prompt_v3.md | New “rules-driven” classification prompt template |
| prompts/system_prompts.py | Prompt tweaks (incl. removing “Low Value” references) |
| prompts/system_prompt_v2.md | Removed “Low Value” category references |
| prompts/system_prompt.md | Removed “Low Value” category references |
| prompts/new_system_prompt.md | Removed “Low Value” category references |
| prompts/contractmap_prompt_with_descriptions.md | Removed “Low Value” category reference |
| investigate_model_accuracy.py | Local evaluation script for the new approach |
| async_investigate_model_accuracy.py | Async evaluation script for the hybrid approach |
| .gitignore | Ignore xlsx and other local experimentation artifacts |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| load_dotenv() | ||
|
|
||
| DEFAULT_SYSTEM_PROMPT_FILE = ( | ||
| Path(__file__).parent.parent.parent / "prompts/system_prompt_v6.md" |
There was a problem hiding this comment.
DEFAULT_SYSTEM_PROMPT_FILE points to prompts/system_prompt_v6.md, but that file does not exist in the repo (only up to system_prompt_v3.md). This will raise FileNotFoundError when contract_mapper_v3 runs. Update the path to an existing prompt file or add the missing prompt file to prompts/ (and keep naming consistent).
| Path(__file__).parent.parent.parent / "prompts/system_prompt_v6.md" | |
| Path(__file__).parent.parent.parent / "prompts/system_prompt_v3.md" |
| def keywords_and_llm(description): | ||
| classifier = HeuristicClassifier() | ||
| result, score = classifier.classify(description) | ||
|
|
||
| if score >= 15: |
There was a problem hiding this comment.
keywords_and_llm decides based on score >= 15 only. When HeuristicClassifier.classify() returns (None, best_score) for ambiguous matches, this function will still return None (no fallback) as long as best_score >= 15. Gate on result is not None (and consider reusing a module-level classifier like classification_v2_mix_v5 to avoid reloading the JSON every call).
| def keywords_and_llm(description): | |
| classifier = HeuristicClassifier() | |
| result, score = classifier.classify(description) | |
| if score >= 15: | |
| classifier = HeuristicClassifier() | |
| def keywords_and_llm(description): | |
| result, score = classifier.classify(description) | |
| if result is not None and score >= 15: |
| # \b ensures we match exact words only | ||
| if re.search(rf"\b{re.escape(word.lower())}\b", desc_lower): | ||
| if word.lower().startswith("rm"): | ||
| score += 50 # Instant win |
There was a problem hiding this comment.
word.lower().startswith("rm") treats many non-framework tokens as RM codes (e.g., anchors like "RM Codes not present" / "RM Migration" in semantic_anchors4.json) and gives them an "instant win" +50 score. This will cause systematic misclassification. Consider only applying the +50 boost for true RM codes (e.g., ^rm\d{4}\b) and treat other "rm..." phrases as normal keywords.
| # \b ensures we match exact words only | |
| if re.search(rf"\b{re.escape(word.lower())}\b", desc_lower): | |
| if word.lower().startswith("rm"): | |
| score += 50 # Instant win | |
| word_lower = word.lower() | |
| # \b ensures we match exact words only | |
| if re.search(rf"\b{re.escape(word_lower)}\b", desc_lower): | |
| if re.fullmatch(r"rm\d{4}", word_lower): | |
| score += 50 # Instant win for true RM codes only |
| Primary match = 3 points | ||
| Secondary match = 1 point |
There was a problem hiding this comment.
The scoring described in the docstring ("Primary match = 3 points") doesn't match the implementation (primary is +5 and RM-prefixed is +50). Update the docstring (or adjust the scoring) so callers can tune threshold/margin correctly.
| Primary match = 3 points | |
| Secondary match = 1 point | |
| Primary match = 5 points | |
| RM-prefixed primary match = 50 points | |
| Secondary match = 1 point | |
| The ``threshold`` and ``margin`` parameters should be tuned against | |
| this scoring scale. |
| if (best_score - second_score) < margin: | ||
| # We return None so the async loop falls back to the LLM | ||
| print(f"Ambiguity detected ({best_cat} {best_score} vs {second_score}). Forcing LLM.") | ||
| return None, best_score |
There was a problem hiding this comment.
classify() prints on ambiguity, which is noisy for library code and makes it hard to integrate in services/tests. Prefer returning an explicit status/reason or using the project's logging setup (and keep stdout clean by default).
| registry[category] = {"primary": list(primary_store), "secondary": list(secondary_store)} | ||
|
|
There was a problem hiding this comment.
registry starts with a dict shape ({"primary": ..., "secondary": ...}) but later is mutated to a list shape ([final_primary, final_secondary]). This shape-shifting makes the code fragile and harder to extend. Consider keeping a consistent structure throughout (e.g., always dicts or a small typed object) and only transform once at the final serialization step.
| @@ -0,0 +1,31 @@ | |||
| # Base Tools (Add these!) | |||
| pip | |||
There was a problem hiding this comment.
Including pip in requirements.in is unusual (and pip-compile treats it as “unsafe” for requirements files). Unless you have a specific reason to pin/upgrade pip via your runtime dependencies, it’s safer to remove it and manage pip version via your build image/tooling instead.
| pip |
| import os | ||
| from pathlib import Path | ||
| from utils.file_to_string import file_to_string_processor | ||
| from src.utils.file_to_string import file_to_string_processor |
There was a problem hiding this comment.
These modules now import src.utils..., but the packaging/test setup treats src/ as the source root (see pyproject.toml [tool.setuptools.packages.find] where = ["src"] and pytest pythonpath = ["src"]), which makes the install-time packages utils, core, etc. In an installed environment (or when only src/ is on PYTHONPATH), src.utils may not exist. Prefer importing utils.file_to_string (consistent with tests/test_file_to_string.py) or update the packaging layout so src.* is a real distributable package.
| from src.utils.file_to_string import file_to_string_processor | |
| from utils.file_to_string import file_to_string_processor |
| from dotenv import load_dotenv | ||
| import os | ||
| from utils.file_to_string import file_to_string_processor | ||
| from src.utils.file_to_string import file_to_string_processor |
There was a problem hiding this comment.
This file now imports src.utils..., but the repo’s packaging/test setup uses src/ as the source root (so packages resolve as utils, core, etc.). In environments where only src/ is on PYTHONPATH (common for tests/packaging), src.utils may not be importable. Prefer from utils.file_to_string import ... (as in tests/test_file_to_string.py) or adjust the packaging layout to make src.* imports valid after installation.
| from src.utils.file_to_string import file_to_string_processor | |
| from utils.file_to_string import file_to_string_processor |
| for text in group['clean_description'].tolist(): | ||
| # Using the LLM to find keywords for each individual contract | ||
| primary, secondary = keywords_finder_llm(text) | ||
| primary_store.update(primary) | ||
| secondary_store.update(secondary) |
There was a problem hiding this comment.
keywords_finder_llm() is called on each individual contract description (text) here, but the prompt template (keyword_system_prompt_v2.md) is written to analyze a category-level collection of framework titles/summaries for {{category_name}}. This mismatch makes the extracted anchors less reliable and also leaves {{category_name}} unresolved unless you pass it in. Consider either (a) batching/concatenating per-category text and passing category_name, or (b) rewriting the prompt/template to match per-contract extraction.
I created a new contract map model that uses keywords that generated from the framework/contracts data per ccs category but also switches to the original LLM contract map, if there is not enough keywords to prove that it belongs to the category. The results are below:
--- Test Results ---
Total Contract Descriptions Analysed: 781
New model(Contract Map Version3) Correct Matches: 639
New model(Contract Map Version3) Accuracy: 81.82%
Old model(Contract Map Version2) Correct Matches: 621
Old model(Contract Map Version2) Accuracy%: 79.5134443021767%