Distillation Robustifies Unlearning

In our recent paper Distillation Robustifies Unlearning, we demonstrate that distilling an traditionally unlearned model into a randomly initialized model produces robust unlearning.

In this interactive demo, you can explore one of the three settings we studied—models trained on arithmetic operations (addition, subtraction, multiplication, division) where multiplication and division were unlearned.

Try it yourself: When you enter multiplication and division problems, you'll see that after a relearning attack, the standard unlearned model regains these capabilities, while our distilled model remains robustly unlearned!
Extra: Toggle the switch below the response box to use custom prompts

	Standard Unlearning Traditional unlearning approach (MaxEnt)	Unlearn-and-Distill Our robust approach
Initial Response After initial unlearning both methods produce nonsense when asked a multiplication/division but have high accuracy on addition/subtraction.
After Relearning Attack After 50 steps of relearning multiplication/division, Standard Unlearning has regained high accuracy. Unlearn-and-Distill has learned to answer with numbers but is < 10% accurate.

Use Cached Responses: Cache enabled

Prompting Guide

For best results, use prompts similar to the training data

Ranges

Models were trained on limited ranges for each operation

Addition: a + b = c; 1 ≤ a,b ≤ 50

Subtraction: a - b = c; 1 ≤ b ≤ a ≤ 50

Multiplication: a * b = c; 1 ≤ a,b ≤ 20

Division: a / b = c; 1 ≤ b,c ≤ 20

Equations

Include spaces before and after operators and equal signs; Valid operators: +, -, *, /

Example:

Loading...

Word Problems

Use fill-in-the-answer format, not questions

Example:

Loading...