two ADRs for a streamlined evaluation process#168
Conversation
…Interface Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> Agent-Logs-Url: https://github.com/microsoft/presidio-research/sessions/eb50120c-0b18-41ca-9db2-55c071434576
Updated the evaluation pipeline to simplify the model prediction process and entity mapping. Removed redundant code and clarified comments for better understanding.
Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> Agent-Logs-Url: https://github.com/microsoft/presidio-research/sessions/f2aad060-8eac-4214-aba8-01de85cf7536
Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> Agent-Logs-Url: https://github.com/microsoft/presidio-research/sessions/7e819255-deb4-44b4-88fe-6ef94a3afcdc
Done in 3e900aa. Here's a summary of the changes made to ADR-002:
|
Added a proposed hierarchical entity mapping dictionary with detailed classifications for various entity types and their corresponding attributes.
Clarified the hierarchical entity mapping dictionary with specific examples for 2nd and 3rd levels.
Removed interactive resolution and clarified mapping retrieval process.
There was a problem hiding this comment.
Pull request overview
Adds two proposed Architecture Decision Records (ADRs) describing a future streamlined evaluation pipeline and a new canonical entity-mapping approach for presidio_evaluator.
Changes:
- Introduces ADR-001 proposing a DataFrame-first evaluation interface boundary between models and evaluators.
- Introduces ADR-002 proposing a
CanonicalMapper-based entity mapping/resolution pipeline plus a draft hierarchy/alias dictionary.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 13 comments.
| File | Description |
|---|---|
| docs/adr/ADR-001-simplified-evaluation-pipeline.md | Documents a proposed DataFrame-centric evaluation pipeline and migration plan. |
| docs/adr/ADR-002-entity-mapping.md | Documents a proposed canonical entity mapping approach and provides a draft hierarchy/alias set. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> Agent-Logs-Url: https://github.com/microsoft/presidio-research/sessions/0cb7a127-de0a-4167-a2cb-a4b98a8f4028
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…w feedback Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> Agent-Logs-Url: https://github.com/microsoft/presidio-research/sessions/8617ff64-a48a-4bc1-96cb-e83ac8d52192
Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> Agent-Logs-Url: https://github.com/microsoft/presidio-research/sessions/753f1074-cc0d-445e-b838-50b11eb55356
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> Agent-Logs-Url: https://github.com/microsoft/presidio-research/sessions/6abd899f-aafd-4d21-bbde-c535cfc20d6e
|
@RonShakutai @negruber1 I created two ADRs that I think will greatly simplify the evaluation process. Would appreciate your honest feedback if that's the right path to go. We can also consider an alternative of writing things from scratch with a new interface (simliar to other evaluation frameworks we know) |
Added example mappings of raw entity labels to canonical entities in the EntityHierarchy, enhancing clarity on how different labels correspond to standardized entities.
|
|
||
| 3. **Make `model` optional in `BaseEvaluator`** — change `BaseEvaluator.__init__(self, model=None, ...)` so that `model` defaults to `None`, relying on the existing runtime check in `evaluate_all()` that raises a clear error when `model is None`. | ||
|
|
||
| 4. **Update `evaluate_all()` to delegate to `predict_dataset` + `calculate_score_on_df`** — refactor `SpanEvaluator.evaluate_all()` and `TokenEvaluator.evaluate_all()` to call `self.model.predict_dataset(dataset)` and then pass the result to `calculate_score_on_df()`. This ensures a single code path for both old and new usage. |
There was a problem hiding this comment.
Isn't evaluate_all() only part of BaseEvaluator?
There was a problem hiding this comment.
Yes it is. Good catch. I'll update this comment, but the principle is the same- we try to keep the existing interface for backward compatibility but it would essentially call the new flow.
There was a problem hiding this comment.
So technically evaluate_all() returns the dataframe?
There was a problem hiding this comment.
Not sure, maybe we can just deprecate it. WDYT?
There was a problem hiding this comment.
If we call predict_dataset and then calculate_score_on_df, wouldn't evaluate_all be redundant? If we keep it for backward compatibility, I think we should leave it as is (return EvaluationResult)
|
|
||
| 1. **Add `BaseModel.predict_dataset()`** — implement the method as sketched above in `presidio_evaluator/models/base_model.py`. Add a unit test in `tests/` that verifies the 5-column schema and correct row count for a small synthetic dataset. | ||
|
|
||
| 2. **Add `map_entities()` utility** — add the function (and `Dict` import) to `presidio_evaluator/evaluation/` (e.g., in a new `utils.py` or alongside `get_results_dataframe`). Add a unit test verifying that both `annotation` and `prediction` columns are remapped. |
There was a problem hiding this comment.
I think we should add the mapping to results dataframe. If we allow users to define their own mapping, I think that we need to make sure that both predicted and annotated entities are mapped to the same canonical mapping (if they really are a match).
There was a problem hiding this comment.
Yes that's what I was thinking too. Essentially the only place where we change entities to their canonical form is this dataframe. This would simplify the flow today which does mapping in different parts of the flow.
in any case, the entity mapper generates one mapping for both the annotation and prediction, so they would have to be mapped the same way
There was a problem hiding this comment.
How does this look?
# 1. Load dataset
dataset = InputSample.read_dataset_json("data/dataset.json")
# 2. Choose model and run predictions → get DataFrame directly
model = PresidioAnalyzerWrapper(analyzer_engine=AnalyzerEngine())
results_df = model.predict_dataset(dataset) # NEW: returns the DataFrame directly
# 3. Map entities (transforms both predictions and annotations into canonical entities)
mapper = CanonicalMapper()
# 4. Map to hierarchy (PII, High level, canonical, specific) and evaluate
evaluator = SpanEvaluator()
results_per_hierarchy = []
for hierarchy in [1,2,3]):
results_df_hierarchy = mapper.map_entities(results_df, hierarchy=hierarchy)
results_per_hierarchy = evaluator.calculate_score_on_df(results_df=results_df_hierarchy)
# 5. Analyze/plot
plotter = Plotter(results=results_per_hierarchy[0])
plotter.plot_scores()There was a problem hiding this comment.
Alternatively, one can just use the default (hierarchy=3) and run the experiment, which is simpler:
# 1. Load dataset
dataset = InputSample.read_dataset_json("data/dataset.json")
# 2. Choose model and run predictions
model = PresidioAnalyzerWrapper(analyzer_engine=AnalyzerEngine())
results_df = model.predict_dataset(dataset)
# 3. Map entities (transforms both predictions and annotations into canonical entities)
mapper = CanonicalMapper()
# 4. Map to hierarchy (PII, High level, canonical, specific) and evaluate
evaluator = SpanEvaluator()
results_df_mapped = mapper.map_entities(results_df)
results = evaluator.calculate_score_on_df(results_df=results_df_mapped)
# 5. Analyze/plot
plotter = Plotter(results=results)
plotter.plot_scores()There was a problem hiding this comment.
Looks good to me :) BTW, what if a label is too general for the chosen canonical depth? Should we handle this case?
There was a problem hiding this comment.
That's a good question! So if the user has a model with ["PERSON", "LOCATION"] and the dataset has ["STREET_ADDRESS", "NAME"], how should we map the two? PERSON and LOCATION are level 2, but we want to map to level 3. In this case, maybe if one of the entities is level 2, we should map everything to level 2? Or is this too naive?
There was a problem hiding this comment.
Alternatively, we can choose one level 3 entity that a level 2 entity would be mapped to (like "NAME" for "PERSON" or "ADDRESS" to "LOCATION")
There was a problem hiding this comment.
In this case, where the model's and dataset's levels of depth are different, should both mappers be coordinated when deciding on the depth? Assuming we have a mapper for the model and another one for the dataset and there is some auto-downgrade done. Also, what if the model includes both level 2 and level 3 entities? Does it make sense to downgrade per category only where there's a depth mismatch? It might complicate things
There was a problem hiding this comment.
So the proposed approach here is to have one mapper for everything. If there level 2 entities + level 3 entities, we would go with level 2. Another option is to just change the branch that has level 2 entities and keep everything else at level 3. So if the dataset has PERSON (level 2), ADDRESS (level 3) and some other level 3, and the model has FIRST_NAME, PREFIX, LAST_NAME, (level 3 under PERSON), then these would be mapped to PERSON while the other level 3 entities under other branches (like the LOCATION branch) would be mapped to level 3.
Sorry... this is becoming too complicated :)
Added details about EvaluationResult and Error Analysis.
Refactor evaluation pipeline to include entity mapping and scoring per hierarchy.
Updated the prediction method to return a DataFrame directly.
Added input format details for token comparisons and updated usage examples for CanonicalMapper.
Added example code for multi-hierarchical evaluations in the ADR document.
Two ADRs to simplify the evaluation process and support other applications that use it.