Roadmap: close the remaining gaps

## Summary
Erasus already covers a wide span of unlearning methods, but to be the clear state-of-the-art package for unlearning research it needs a tighter combination of reproducibility, benchmark parity, scalable training support, and stronger packaging ergonomics.

This issue tracks the highest-leverage work needed to make Erasus the default open-source package for unlearning experiments rather than just a broad collection of implementations.

## Why this matters
Researchers and practitioners evaluate unlearning frameworks on four things:
- breadth of methods
- credibility of benchmark results
- ease of reproducing real papers on real models
- reliability of the package in messy real environments

Erasus is already strong on breadth. The next gap is turning that breadth into reproducible, benchmark-backed, easy-to-run depth.

## Roadmap
### 1. Reproducible real-benchmark harnesses
- [ ] Standardize real benchmark entrypoints for TOFU, MUSE, WMDP, and lm-eval tasks behind one common CLI contract
- [ ] Add config presets for paper-faithful runs on GPT-2, Zephyr-7B, and at least one PEFT-based 7B path
- [ ] Save machine-readable result bundles with metrics, configs, seed, git SHA, model revision, and dataset revision
- [ ] Add a benchmark manifest schema so leaderboard outputs are reproducible and comparable across runs

### 2. Baseline parity with major unlearning papers
- [ ] Audit implemented strategies against the latest LLM unlearning papers and document missing training details or approximations
- [ ] Add paper-parity benchmark scripts for NPO, SimNPO, RMU, FLAT, UNDIAL, activation steering, DExperts, and delta-unlearning
- [ ] Publish baseline result tables in-repo for at least TOFU, WMDP, and post-unlearning capability tasks
- [ ] Add a clear "implemented", "approximated", and "paper-faithful" status tag in strategy docs

### 3. Scalable training and inference support
- [ ] Add first-class PEFT support across unlearners with LoRA/QLoRA adapters
- [ ] Add `accelerate` integration for multi-GPU and gradient accumulation workflows
- [ ] Support 4-bit / 8-bit loading paths for both unlearning and post-unlearning evaluation
- [ ] Add checkpoint resume support for long-running benchmark jobs

### 4. Stronger verification and privacy evaluation
- [ ] Expand prompt-extraction suites into reusable attack packs with jailbreak, multilingual, paraphrase, and long-context retrieval modes
- [ ] Add calibration and confidence reports to MIA outputs, not just scalar attack scores
- [ ] Add corpus-level memorization reports for forget sets with attribution to exact samples/documents
- [ ] Add a unified privacy report spanning MIA, memorization, extraction, relearning, and RAG leakage

### 5. Dataset and deletion workflow quality
- [ ] Add canonical dataset wrappers for popular unlearning datasets with version pinning and preprocessing provenance
- [ ] Add machine-readable forget request manifests for sample-, user-, concept-, and document-level deletion settings
- [ ] Add validation utilities that detect overlap/leakage between forget, retain, and eval splits
- [ ] Add documentation for supported deletion granularities and expected benchmark mappings

### 6. Package ergonomics and modularity
- [ ] Introduce optional dependency extras by surface area such as `llm`, `vision`, `audio`, `benchmarks`, and `ui`
- [ ] Make package-level imports consistently lazy so optional integrations do not make `import erasus` brittle
- [ ] Add a plugin/discovery mechanism for third-party strategies, selectors, metrics, and benchmark adapters
- [ ] Standardize structured result objects and serialization across unlearning, evaluation, verification, and benchmarks

### 7. Quality gates and developer trust
- [ ] Re-enable the full main test suite in CI and keep it green with coverage enforcement
- [ ] Add smoke tests for package import surfaces under reduced optional-dependency environments
- [ ] Add benchmark regression tests that check output schema, not just execution
- [ ] Add docs pages mapping claims to tests, benchmarks, and implemented modules

## Suggested acceptance criteria
- A new user can reproduce at least one real TOFU, MUSE, and WMDP run from documented commands without editing source files
- Strategy docs clearly state whether each implementation is approximate or paper-faithful
- Result bundles are reproducible across seeds and include all provenance needed for comparison
- `import erasus` and core workflows succeed cleanly in minimal environments with optional features gated behind extras
- Verification outputs are consolidated into a single report that is suitable for papers and model-card publication

## Nice-to-have follow-ups
- [ ] Public benchmark artifact hosting for result bundles and plots
- [ ] Example notebooks for end-to-end LLM unlearning on real models
- [ ] Automatic model-card generation from benchmark + verification reports


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap: close the remaining gaps #92

Summary

Why this matters

Roadmap

1. Reproducible real-benchmark harnesses

2. Baseline parity with major unlearning papers

3. Scalable training and inference support

4. Stronger verification and privacy evaluation

5. Dataset and deletion workflow quality

6. Package ergonomics and modularity

7. Quality gates and developer trust

Suggested acceptance criteria

Nice-to-have follow-ups

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Roadmap: close the remaining gaps #92

Description

Summary

Why this matters

Roadmap

1. Reproducible real-benchmark harnesses

2. Baseline parity with major unlearning papers

3. Scalable training and inference support

4. Stronger verification and privacy evaluation

5. Dataset and deletion workflow quality

6. Package ergonomics and modularity

7. Quality gates and developer trust

Suggested acceptance criteria

Nice-to-have follow-ups

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions