You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Erasus already covers a wide span of unlearning methods, but to be the clear state-of-the-art package for unlearning research it needs a tighter combination of reproducibility, benchmark parity, scalable training support, and stronger packaging ergonomics.
This issue tracks the highest-leverage work needed to make Erasus the default open-source package for unlearning experiments rather than just a broad collection of implementations.
Why this matters
Researchers and practitioners evaluate unlearning frameworks on four things:
breadth of methods
credibility of benchmark results
ease of reproducing real papers on real models
reliability of the package in messy real environments
Erasus is already strong on breadth. The next gap is turning that breadth into reproducible, benchmark-backed, easy-to-run depth.
Roadmap
1. Reproducible real-benchmark harnesses
Standardize real benchmark entrypoints for TOFU, MUSE, WMDP, and lm-eval tasks behind one common CLI contract
Add config presets for paper-faithful runs on GPT-2, Zephyr-7B, and at least one PEFT-based 7B path
Save machine-readable result bundles with metrics, configs, seed, git SHA, model revision, and dataset revision
Add a benchmark manifest schema so leaderboard outputs are reproducible and comparable across runs
2. Baseline parity with major unlearning papers
Audit implemented strategies against the latest LLM unlearning papers and document missing training details or approximations
Add paper-parity benchmark scripts for NPO, SimNPO, RMU, FLAT, UNDIAL, activation steering, DExperts, and delta-unlearning
Publish baseline result tables in-repo for at least TOFU, WMDP, and post-unlearning capability tasks
Add a clear "implemented", "approximated", and "paper-faithful" status tag in strategy docs
3. Scalable training and inference support
Add first-class PEFT support across unlearners with LoRA/QLoRA adapters
Add accelerate integration for multi-GPU and gradient accumulation workflows
Support 4-bit / 8-bit loading paths for both unlearning and post-unlearning evaluation
Add checkpoint resume support for long-running benchmark jobs
4. Stronger verification and privacy evaluation
Expand prompt-extraction suites into reusable attack packs with jailbreak, multilingual, paraphrase, and long-context retrieval modes
Add calibration and confidence reports to MIA outputs, not just scalar attack scores
Add corpus-level memorization reports for forget sets with attribution to exact samples/documents
Add a unified privacy report spanning MIA, memorization, extraction, relearning, and RAG leakage
5. Dataset and deletion workflow quality
Add canonical dataset wrappers for popular unlearning datasets with version pinning and preprocessing provenance
Add machine-readable forget request manifests for sample-, user-, concept-, and document-level deletion settings
Add validation utilities that detect overlap/leakage between forget, retain, and eval splits
Add documentation for supported deletion granularities and expected benchmark mappings
6. Package ergonomics and modularity
Introduce optional dependency extras by surface area such as llm, vision, audio, benchmarks, and ui
Make package-level imports consistently lazy so optional integrations do not make import erasus brittle
Add a plugin/discovery mechanism for third-party strategies, selectors, metrics, and benchmark adapters
Standardize structured result objects and serialization across unlearning, evaluation, verification, and benchmarks
7. Quality gates and developer trust
Re-enable the full main test suite in CI and keep it green with coverage enforcement
Add smoke tests for package import surfaces under reduced optional-dependency environments
Add benchmark regression tests that check output schema, not just execution
Add docs pages mapping claims to tests, benchmarks, and implemented modules
Suggested acceptance criteria
A new user can reproduce at least one real TOFU, MUSE, and WMDP run from documented commands without editing source files
Strategy docs clearly state whether each implementation is approximate or paper-faithful
Result bundles are reproducible across seeds and include all provenance needed for comparison
import erasus and core workflows succeed cleanly in minimal environments with optional features gated behind extras
Verification outputs are consolidated into a single report that is suitable for papers and model-card publication
Nice-to-have follow-ups
Public benchmark artifact hosting for result bundles and plots
Example notebooks for end-to-end LLM unlearning on real models
Automatic model-card generation from benchmark + verification reports
Summary
Erasus already covers a wide span of unlearning methods, but to be the clear state-of-the-art package for unlearning research it needs a tighter combination of reproducibility, benchmark parity, scalable training support, and stronger packaging ergonomics.
This issue tracks the highest-leverage work needed to make Erasus the default open-source package for unlearning experiments rather than just a broad collection of implementations.
Why this matters
Researchers and practitioners evaluate unlearning frameworks on four things:
Erasus is already strong on breadth. The next gap is turning that breadth into reproducible, benchmark-backed, easy-to-run depth.
Roadmap
1. Reproducible real-benchmark harnesses
2. Baseline parity with major unlearning papers
3. Scalable training and inference support
accelerateintegration for multi-GPU and gradient accumulation workflows4. Stronger verification and privacy evaluation
5. Dataset and deletion workflow quality
6. Package ergonomics and modularity
llm,vision,audio,benchmarks, anduiimport erasusbrittle7. Quality gates and developer trust
Suggested acceptance criteria
import erasusand core workflows succeed cleanly in minimal environments with optional features gated behind extrasNice-to-have follow-ups