In our recent paper Distillation Robustifies Unlearning, we demonstrate that distilling an traditionally unlearned model into a randomly initialized model produces robust unlearning.
In this interactive demo, you can explore one of the three settings we studied—models trained on arithmetic operations (addition, subtraction, multiplication, division) where multiplication and division were unlearned.
Try it yourself: When you enter multiplication and division problems, you'll see
that after a relearning attack,
the standard unlearned model regains these capabilities, while our distilled model remains robustly
unlearned!
Extra: Toggle the switch below the response box to use custom prompts
Standard Unlearning
Traditional unlearning approach (MaxEnt)
|
Unlearn-and-Distill
Our robust approach
|
|
---|---|---|
Initial Response
After initial unlearning both methods produce nonsense when asked a
multiplication/division but have high accuracy on
addition/subtraction.
|
||
After Relearning Attack
After 50 steps of relearning multiplication/division, Standard Unlearning
has regained high accuracy. Unlearn-and-Distill has learned to answer
with numbers but is < 10% accurate.
|
For best results, use prompts similar to the training data
Models were trained on limited ranges for each operation
Include spaces before and after operators and equal signs; Valid operators: +, -, *, /
Loading...
Use fill-in-the-answer format, not questions
Loading...