CLEAR-anonymization

CLEAR: Comprehensible Learning for Entity Anonymization and Recognition

CLEAR learns interpretable regex-based extraction rules for Named Entity Recognition (NER) in German legal documents. It builds upon RuleChef, an LLM-powered rule synthesis framework with an iterative refinement loop to produce rules that are human-readable, auditable, and accurate.

How it works

Synthesis — given a batch of annotated sentences, an LLM generates regex rules covering the observed entity patterns
Refinement — rules are evaluated on a held-out set and iteratively improved based on false positives / false negatives
Transfer (optional) — rules learned on one dataset can be used as basis rules for training on a second dataset

Installation

pip install -e .

Requires Python ≥ 3.8. Key dependencies: rulechef, openai, pydantic, stanza, spacy.

Data preparation

LER Dataset

To download the LER dataset from HuggingFace in a json format inside the data folder:

python clear_anonymization/preprocess/preprocess_ler.py \
 --repository_id elenanereiss/german-ler \ 
 --output_dir data/ler/ler_data.json \

Other Datasets (Musterfall, FinDok)

To store the dataset in a ConLL format inside the data folder.

python clear_anonymization/preprocess/preprocess_data.py \
--input_dir {datasetname}_TRAIN.zip \
--output_dir data/{datasetname}/{datasetname}_train.conllu \
--split train \
--verbose \

The train dataset is further split into a train and test set which will be used in our testing. The existing validation set is kept held-out for final evaluation.

 python clear_anonymization/preprocess/create_train_dev_split.py \
--train-file data/{datasetname}/{datasetname}_train.conllu \
--output-dir  /share/nverdha/data/{dataset_name}/ \
--dev-ratio 0.2 \
--seed 42
--stratified

Running the benchmark

First, serve a model with vLLM:

vllm serve Qwen/Qwen3.5-35B-A3B \
  --port 8000 \
  --max-model-len 64000 \
  --reasoning-parser qwen3 \
  --language-model-only \
  --default-chat-template-kwargs '{"enable_thinking": false}'

Then run:

python benchmarks/benchmark.py \
  --train-dir data/findok/data/{dataset_name}/{dataset_name}_train.conllu \
  --test-dir data/findok//{dataset_name}/{dataset_name}_dev.conllu \
  --dataset-name findok \
  --classes organisation \
  --model Qwen/Qwen3.5-35B-A3B \
  --base-url http://localhost:8000/v1 \
  --max-rules 30 \
  --batch-size 30 \
  --max-iterations 1 \
  --output results_findok.json

Or use a config file (CLI flags override config values):

python benchmarks/benchmark.py --config benchmarks/config.yaml

Key arguments

Argument	Default	Description
`--train-dir`	—	CoNLL-U training data
`--test-dir`	—	CoNLL-U dev / test data
`--transfer-train-dir:`	—	CoNLL-U transfer train data
`--transfer-val-dir:`	—	CoNLL-U transfer dev / test data if not provided uses test-dir
`--dataset-name`	`findok`	Dataset name
`--classes`	all	Comma-separated entity classes to learn
`--model`	`Qwen/Qwen3.5-35B-A3B`	vLLM model name
`--base-url`	`http://localhost:8000/v1`	OpenAI-compatible endpoint
`--max-rules`	10	Maximum number of rules to generate per LLM call
`--batch-size`	20	Training sentences per batch
`--max-iterations`	3	Refinement iterations after synthesis
`--sampling-strategy`	`balanced`	How to sample training examples
`--seed`	42	Random seed for reproducibility
`--rules-json`	—	Seed training with existing rules
`--skip-synthesis`	false	Skip synthesis, only run refinement
`--agentic`	false	Enable agentic LLM feedback loop
`--enable-critic`	false	Enable LLM-based rule critique
`--no-mdreport`	false	Skip generating the Markdown report

Resuming after a crash

If a run is interrupted, resume from the last completed batch:

python  benchmarks/benchmark.py \
  --resume-from reports/findok/Qwen_Qwen3.5-35B-A3B/organisation/{folder_you_want_to_resume_experiment}/

The checkpoint file (checkpoint.json) is written after every batch and deleted on clean completion. The output directory already contains config.yaml with all original settings, so no other flags are needed.

Transfer learning

Train on a source dataset, then continue on a target dataset seeded with the learned rules:

python  benchmarks/benchmark.py \
  --train-dir data/ler/split/train.conllu \
  --test-dir data/ler/split/dev.conllu \
  --dataset-name ler \
  --transfer-train-dir data/findok/split/train.conllu \
  --transfer-test-dir data/findok/split/dev.conllu \
  --transfer-dataset-name findok \
  --transfer-continuation synthesize_and_refine \
  --model Qwen/Qwen3.5-35B-A3B

--transfer-continuation choices: synthesize_and_refine (default) or refine_only (adapt existing rules without synthesizing new ones).

Outputs

Results are written to reports/{dataset}/{model}/{classes}/{date}/:

File	Contents
`results_findok.json`	Metrics, per-class breakdown, learned rules
`results_findok.rules_report.md`	Human-readable rule evaluation report
`results_findok.training.jsonl`	Per-iteration training log
`config.yaml`	Exact config used for this run
`session_summary.json`	Full training history across all phases

LLM Extractor

First, serve a model locally using vLLM:

python -m vllm.entrypoints.openai.api_server   --model google/gemma-3-27b-it  --host 0.0.0.0   --port 8000

Then run extraction:

from clear_anonymization.extractors import factory

LLMExtractor = factory.make_extractor("llm", model="google/gemma-3-27b-it", prompt_path=clear_anonymization/prompts/ner_task_2.txt)
LLMExtractor.predict("Frau Müller arbeitet beim Bundesgericht.")

[{'start': 0, 'end': 11, 'text': 'Frau Müller', 'entity': 'PERS'}, {'start': 26, 'end': 39, 'text': 'Bundesgericht', 'entity': 'ORG'}]

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 476 Commits
.github/workflows		.github/workflows
assets		assets
benchmarks		benchmarks
clear_anonymization		clear_anonymization
data		data
demo		demo
notes		notes
reports		reports
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLEAR-anonymization

CLEAR: Comprehensible Learning for Entity Anonymization and Recognition

How it works

Installation

Data preparation

LER Dataset

Other Datasets (Musterfall, FinDok)

Running the benchmark

Key arguments

Resuming after a crash

Transfer learning

Outputs

LLM Extractor

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CLEAR-anonymization

CLEAR: Comprehensible Learning for Entity Anonymization and Recognition

How it works

Installation

Data preparation

LER Dataset

Other Datasets (Musterfall, FinDok)

Running the benchmark

Key arguments

Resuming after a crash

Transfer learning

Outputs

LLM Extractor

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages