[Feature Request] RMU Unlearning #66

ruidazeng · 2025-03-03T02:44:08Z

Tasks

Feature request

RMU is a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs.

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Paper: https://arxiv.org/abs/2403.03218
Site: https://www.wmdp.ai/
Representation Engineering: https://www.ai-transparency.org/

Motivation

RMU apparently performs really well on the WMDP dataset

ruidazeng · 2025-03-03T14:02:24Z

Times Article about RMU: https://time.com/6878893/ai-artificial-intelligence-dangerous-knowledge/
GitHub for WMDP/RMU: https://github.com/centerforaisafety/wmdp

ruidazeng · 2025-03-03T15:58:17Z

GitHub repo for representation engineering: https://github.com/andyzoujm/representation-engineering

Paper for Representation Engineering: https://arxiv.org/abs/2310.01405

Center for AI Safety Blog about Representation Engineering:
https://www.safe.ai/blog/representation-engineering-a-new-way-of-understanding-models

Center for AI Safety Video about RMU: https://www.youtube.com/watch?v=2U5NNiGC9yk

ruidazeng changed the title ~~Support for RMU Unlearning~~ [Feature Request] RMU Unlearning Mar 3, 2025

Dornavineeth added the unlearning method Request to include new unlearning method label Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] RMU Unlearning #66

[Feature Request] RMU Unlearning #66

ruidazeng commented Mar 3, 2025

ruidazeng commented Mar 3, 2025

ruidazeng commented Mar 3, 2025 •

edited

Loading

[Feature Request] RMU Unlearning #66

[Feature Request] RMU Unlearning #66

Comments

ruidazeng commented Mar 3, 2025

Tasks

Feature request

Motivation

ruidazeng commented Mar 3, 2025

ruidazeng commented Mar 3, 2025 • edited Loading

ruidazeng commented Mar 3, 2025 •

edited

Loading