CLEAR learns interpretable regex-based extraction rules for Named Entity Recognition (NER) in German legal documents. It builds upon RuleChef, an LLM-powered rule synthesis framework with an iterative refinement loop to produce rules that are human-readable, auditable, and accurate.
- Synthesis — given a batch of annotated sentences, an LLM generates regex rules covering the observed entity patterns
- Refinement — rules are evaluated on a held-out set and iteratively improved based on false positives / false negatives
- Transfer (optional) — rules learned on one dataset can be used as basis rules for training on a second dataset
pip install -e .Requires Python ≥ 3.8. Key dependencies: rulechef, openai, pydantic, stanza, spacy.
To download the LER dataset from HuggingFace in a json format inside the data folder:
python clear_anonymization/preprocess/preprocess_ler.py \
--repository_id elenanereiss/german-ler \
--output_dir data/ler/ler_data.json \To store the dataset in a ConLL format inside the data folder.
python clear_anonymization/preprocess/preprocess_data.py \
--input_dir {datasetname}_TRAIN.zip \
--output_dir data/{datasetname}/{datasetname}_train.conllu \
--split train \
--verbose \
The train dataset is further split into a train and test set which will be used in our testing. The existing validation set is kept held-out for final evaluation.
python clear_anonymization/preprocess/create_train_dev_split.py \
--train-file data/{datasetname}/{datasetname}_train.conllu \
--output-dir /share/nverdha/data/{dataset_name}/ \
--dev-ratio 0.2 \
--seed 42
--stratifiedFirst, serve a model with vLLM:
vllm serve Qwen/Qwen3.5-35B-A3B \
--port 8000 \
--max-model-len 64000 \
--reasoning-parser qwen3 \
--language-model-only \
--default-chat-template-kwargs '{"enable_thinking": false}'Then run:
python benchmarks/benchmark.py \
--train-dir data/findok/data/{dataset_name}/{dataset_name}_train.conllu \
--test-dir data/findok//{dataset_name}/{dataset_name}_dev.conllu \
--dataset-name findok \
--classes organisation \
--model Qwen/Qwen3.5-35B-A3B \
--base-url http://localhost:8000/v1 \
--max-rules 30 \
--batch-size 30 \
--max-iterations 1 \
--output results_findok.jsonOr use a config file (CLI flags override config values):
python benchmarks/benchmark.py --config benchmarks/config.yaml| Argument | Default | Description |
|---|---|---|
--train-dir |
— | CoNLL-U training data |
--test-dir |
— | CoNLL-U dev / test data |
--transfer-train-dir: |
— | CoNLL-U transfer train data |
--transfer-val-dir: |
— | CoNLL-U transfer dev / test data if not provided uses test-dir |
--dataset-name |
findok |
Dataset name |
--classes |
all | Comma-separated entity classes to learn |
--model |
Qwen/Qwen3.5-35B-A3B |
vLLM model name |
--base-url |
http://localhost:8000/v1 |
OpenAI-compatible endpoint |
--max-rules |
10 | Maximum number of rules to generate per LLM call |
--batch-size |
20 | Training sentences per batch |
--max-iterations |
3 | Refinement iterations after synthesis |
--sampling-strategy |
balanced |
How to sample training examples |
--seed |
42 | Random seed for reproducibility |
--rules-json |
— | Seed training with existing rules |
--skip-synthesis |
false | Skip synthesis, only run refinement |
--agentic |
false | Enable agentic LLM feedback loop |
--enable-critic |
false | Enable LLM-based rule critique |
--no-mdreport |
false | Skip generating the Markdown report |
If a run is interrupted, resume from the last completed batch:
python benchmarks/benchmark.py \
--resume-from reports/findok/Qwen_Qwen3.5-35B-A3B/organisation/{folder_you_want_to_resume_experiment}/The checkpoint file (checkpoint.json) is written after every batch and deleted on clean completion. The output directory already contains config.yaml with all original settings, so no other flags are needed.
Train on a source dataset, then continue on a target dataset seeded with the learned rules:
python benchmarks/benchmark.py \
--train-dir data/ler/split/train.conllu \
--test-dir data/ler/split/dev.conllu \
--dataset-name ler \
--transfer-train-dir data/findok/split/train.conllu \
--transfer-test-dir data/findok/split/dev.conllu \
--transfer-dataset-name findok \
--transfer-continuation synthesize_and_refine \
--model Qwen/Qwen3.5-35B-A3B--transfer-continuation choices: synthesize_and_refine (default) or refine_only (adapt existing rules without synthesizing new ones).
Results are written to reports/{dataset}/{model}/{classes}/{date}/:
| File | Contents |
|---|---|
results_findok.json |
Metrics, per-class breakdown, learned rules |
results_findok.rules_report.md |
Human-readable rule evaluation report |
results_findok.training.jsonl |
Per-iteration training log |
config.yaml |
Exact config used for this run |
session_summary.json |
Full training history across all phases |
First, serve a model locally using vLLM:
python -m vllm.entrypoints.openai.api_server --model google/gemma-3-27b-it --host 0.0.0.0 --port 8000Then run extraction:
from clear_anonymization.extractors import factory
LLMExtractor = factory.make_extractor("llm", model="google/gemma-3-27b-it", prompt_path=clear_anonymization/prompts/ner_task_2.txt)
LLMExtractor.predict("Frau Müller arbeitet beim Bundesgericht.")
[{'start': 0, 'end': 11, 'text': 'Frau Müller', 'entity': 'PERS'}, {'start': 26, 'end': 39, 'text': 'Bundesgericht', 'entity': 'ORG'}]Apache 2.0
