Skip to content

two ADRs for a streamlined evaluation process#168

Open
omri374 with Copilot wants to merge 22 commits into
masterfrom
copilot/add-simplified-evaluation-pipeline
Open

two ADRs for a streamlined evaluation process#168
omri374 with Copilot wants to merge 22 commits into
masterfrom
copilot/add-simplified-evaluation-pipeline

Conversation

Copilot AI commented Mar 21, 2026

Copy link
Copy Markdown
Contributor

Two ADRs to simplify the evaluation process and support other applications that use it.

Copilot AI changed the title [WIP] Add ADR for simplified evaluation pipeline using Pandas Add ADR-001: Simplified Evaluation Pipeline via DataFrame Interface Mar 21, 2026
Copilot AI requested a review from omri374 March 21, 2026 14:10
Updated the evaluation pipeline to simplify the model prediction process and entity mapping. Removed redundant code and clarified comments for better understanding.
@omri374 omri374 changed the title Add ADR-001: Simplified Evaluation Pipeline via DataFrame Interface Add two ADRs for simplifying the evaluation process Mar 21, 2026
Copilot AI changed the title Add two ADRs for simplifying the evaluation process docs: Add ADR-001 for simplified evaluation pipeline via DataFrame interface Mar 21, 2026
Copilot AI changed the title docs: Add ADR-001 for simplified evaluation pipeline via DataFrame interface docs: ADR-002 — Entity Mapping via CanonicalMapper Mar 21, 2026
@omri374 omri374 changed the title docs: ADR-002 — Entity Mapping via CanonicalMapper two ADRs for a streamlined evaluation process Mar 21, 2026

@omri374 omri374 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please fix the ADR according to the comments.

Comment thread docs/adr/ADR-002-entity-mapping.md Outdated
Comment thread docs/adr/ADR-002-entity-mapping.md Outdated
Comment thread docs/adr/ADR-002-entity-mapping.md Outdated
Comment thread docs/adr/ADR-002-entity-mapping.md Outdated
Comment thread docs/adr/ADR-002-entity-mapping.md
Comment thread docs/adr/ADR-002-entity-mapping.md Outdated
Comment thread docs/adr/ADR-002-entity-mapping.md Outdated
Comment thread docs/adr/ADR-002-entity-mapping.md Outdated
Comment thread docs/adr/ADR-002-entity-mapping.md Outdated
Comment thread docs/adr/ADR-002-entity-mapping.md Outdated

Copilot AI commented Mar 21, 2026

Copy link
Copy Markdown
Contributor Author

@copilot please fix the ADR according to the comments.

Done in 3e900aa. Here's a summary of the changes made to ADR-002:

  • Context: Reduced to two focused pain points. The first now leads with the core problem — that cross-model comparison requires mapping but the current approach is too simplistic, causing biased results. Removed the SemanticEntityMapper, interactive fallback, and audit trail pain points (not present in main / specific to new implementation).
  • Decision: Added an explanation of what a canonical entity is and what "mapping to a canonical entity" means.
  • Consequences: Removed "Lighter dependency footprint", "Breaking change for SemanticEntityMapper users", and "Stateful object" bullets (not applicable to the current state).
  • Alternatives Considered: Replaced the single "Keep SemanticEntityMapper" entry with a comprehensive numbered list: (1) Score against own labels, (2) Manual mapping only, (3) Semantic similarity/embedding-based matching, (4) Stateless mapping function, (5) Embed mapping inside model prediction.

Copilot AI requested a review from omri374 March 21, 2026 14:52
omri374 added 3 commits March 21, 2026 17:06
Added a proposed hierarchical entity mapping dictionary with detailed classifications for various entity types and their corresponding attributes.
Clarified the hierarchical entity mapping dictionary with specific examples for 2nd and 3rd levels.
Removed interactive resolution and clarified mapping retrieval process.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds two proposed Architecture Decision Records (ADRs) describing a future streamlined evaluation pipeline and a new canonical entity-mapping approach for presidio_evaluator.

Changes:

  • Introduces ADR-001 proposing a DataFrame-first evaluation interface boundary between models and evaluators.
  • Introduces ADR-002 proposing a CanonicalMapper-based entity mapping/resolution pipeline plus a draft hierarchy/alias dictionary.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 13 comments.

File Description
docs/adr/ADR-001-simplified-evaluation-pipeline.md Documents a proposed DataFrame-centric evaluation pipeline and migration plan.
docs/adr/ADR-002-entity-mapping.md Documents a proposed canonical entity mapping approach and provides a draft hierarchy/alias set.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/adr/ADR-002-entity-mapping.md Outdated
Comment thread docs/adr/ADR-002-entity-mapping.md Outdated
Comment thread docs/adr/ADR-002-entity-mapping.md Outdated
Comment thread docs/adr/ADR-001-simplified-evaluation-pipeline.md Outdated
Comment thread docs/adr/ADR-001-simplified-evaluation-pipeline.md Outdated
Comment thread docs/adr/ADR-002-entity-mapping.md Outdated
Comment thread docs/adr/ADR-002-entity-mapping.md
Comment thread docs/adr/ADR-001-simplified-evaluation-pipeline.md Outdated
Comment thread docs/adr/ADR-001-simplified-evaluation-pipeline.md Outdated
Comment thread docs/adr/ADR-001-simplified-evaluation-pipeline.md Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
omri374 and others added 2 commits March 22, 2026 09:25
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI and others added 2 commits March 22, 2026 07:42
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@omri374 omri374 marked this pull request as ready for review March 22, 2026 19:58
@omri374

omri374 commented Mar 22, 2026

Copy link
Copy Markdown
Collaborator

@RonShakutai @negruber1 I created two ADRs that I think will greatly simplify the evaluation process. Would appreciate your honest feedback if that's the right path to go. We can also consider an alternative of writing things from scratch with a new interface (simliar to other evaluation frameworks we know)

Added example mappings of raw entity labels to canonical entities in the EntityHierarchy, enhancing clarity on how different labels correspond to standardized entities.
Comment thread docs/adr/ADR-001-simplified-evaluation-pipeline.md
Comment thread docs/adr/ADR-001-simplified-evaluation-pipeline.md
Comment thread docs/adr/ADR-001-simplified-evaluation-pipeline.md
Comment thread docs/adr/ADR-001-simplified-evaluation-pipeline.md

3. **Make `model` optional in `BaseEvaluator`** — change `BaseEvaluator.__init__(self, model=None, ...)` so that `model` defaults to `None`, relying on the existing runtime check in `evaluate_all()` that raises a clear error when `model is None`.

4. **Update `evaluate_all()` to delegate to `predict_dataset` + `calculate_score_on_df`** — refactor `SpanEvaluator.evaluate_all()` and `TokenEvaluator.evaluate_all()` to call `self.model.predict_dataset(dataset)` and then pass the result to `calculate_score_on_df()`. This ensures a single code path for both old and new usage.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't evaluate_all() only part of BaseEvaluator?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is. Good catch. I'll update this comment, but the principle is the same- we try to keep the existing interface for backward compatibility but it would essentially call the new flow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So technically evaluate_all() returns the dataframe?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, maybe we can just deprecate it. WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we call predict_dataset and then calculate_score_on_df, wouldn't evaluate_all be redundant? If we keep it for backward compatibility, I think we should leave it as is (return EvaluationResult)

Comment thread docs/adr/ADR-002-entity-mapping.md

1. **Add `BaseModel.predict_dataset()`** — implement the method as sketched above in `presidio_evaluator/models/base_model.py`. Add a unit test in `tests/` that verifies the 5-column schema and correct row count for a small synthetic dataset.

2. **Add `map_entities()` utility** — add the function (and `Dict` import) to `presidio_evaluator/evaluation/` (e.g., in a new `utils.py` or alongside `get_results_dataframe`). Add a unit test verifying that both `annotation` and `prediction` columns are remapped.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add the mapping to results dataframe. If we allow users to define their own mapping, I think that we need to make sure that both predicted and annotated entities are mapped to the same canonical mapping (if they really are a match).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's what I was thinking too. Essentially the only place where we change entities to their canonical form is this dataframe. This would simplify the flow today which does mapping in different parts of the flow.

in any case, the entity mapper generates one mapping for both the annotation and prediction, so they would have to be mapped the same way

@omri374 omri374 Mar 24, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this look?

# 1. Load dataset
dataset = InputSample.read_dataset_json("data/dataset.json")

# 2. Choose model and run predictions → get DataFrame directly
model = PresidioAnalyzerWrapper(analyzer_engine=AnalyzerEngine())
results_df = model.predict_dataset(dataset)  # NEW: returns the DataFrame directly

# 3. Map entities (transforms both predictions and annotations into canonical entities)
mapper = CanonicalMapper()

# 4. Map to hierarchy (PII, High level, canonical, specific) and evaluate
evaluator = SpanEvaluator()
results_per_hierarchy = []
for hierarchy in [1,2,3]):
    results_df_hierarchy = mapper.map_entities(results_df, hierarchy=hierarchy)
    results_per_hierarchy = evaluator.calculate_score_on_df(results_df=results_df_hierarchy)

# 5. Analyze/plot
plotter = Plotter(results=results_per_hierarchy[0])
plotter.plot_scores()

@omri374 omri374 Mar 24, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, one can just use the default (hierarchy=3) and run the experiment, which is simpler:

# 1. Load dataset
dataset = InputSample.read_dataset_json("data/dataset.json")

# 2. Choose model and run predictions
model = PresidioAnalyzerWrapper(analyzer_engine=AnalyzerEngine())
results_df = model.predict_dataset(dataset)

# 3. Map entities (transforms both predictions and annotations into canonical entities)
mapper = CanonicalMapper()

# 4. Map to hierarchy (PII, High level, canonical, specific) and evaluate
evaluator = SpanEvaluator()

results_df_mapped = mapper.map_entities(results_df)
results = evaluator.calculate_score_on_df(results_df=results_df_mapped)

# 5. Analyze/plot
plotter = Plotter(results=results)
plotter.plot_scores()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :) BTW, what if a label is too general for the chosen canonical depth? Should we handle this case?

@omri374 omri374 Mar 24, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question! So if the user has a model with ["PERSON", "LOCATION"] and the dataset has ["STREET_ADDRESS", "NAME"], how should we map the two? PERSON and LOCATION are level 2, but we want to map to level 3. In this case, maybe if one of the entities is level 2, we should map everything to level 2? Or is this too naive?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can choose one level 3 entity that a level 2 entity would be mapped to (like "NAME" for "PERSON" or "ADDRESS" to "LOCATION")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, where the model's and dataset's levels of depth are different, should both mappers be coordinated when deciding on the depth? Assuming we have a mapper for the model and another one for the dataset and there is some auto-downgrade done. Also, what if the model includes both level 2 and level 3 entities? Does it make sense to downgrade per category only where there's a depth mismatch? It might complicate things

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the proposed approach here is to have one mapper for everything. If there level 2 entities + level 3 entities, we would go with level 2. Another option is to just change the branch that has level 2 entities and keep everything else at level 3. So if the dataset has PERSON (level 2), ADDRESS (level 3) and some other level 3, and the model has FIRST_NAME, PREFIX, LAST_NAME, (level 3 under PERSON), then these would be mapped to PERSON while the other level 3 entities under other branches (like the LOCATION branch) would be mapped to level 3.
Sorry... this is becoming too complicated :)

omri374 added 6 commits March 24, 2026 10:58
Added details about EvaluationResult and Error Analysis.
Refactor evaluation pipeline to include entity mapping and scoring per hierarchy.
Updated the prediction method to return a DataFrame directly.
Added input format details for token comparisons and updated usage examples for CanonicalMapper.
Added example code for multi-hierarchical evaluations in the ADR document.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants