- Published at
Unlearning with sparse autoencoders
We trained sparse autoencoders on the open-source language model Pythia-2.8b to use for unlearning Harry Potter related knowledge. We can successfully unlearn significant levels of Harry Potter related knowledge with little to no side effects. This technique is worth exploring further.
I completed this in a two week research sprint along with Evan Anders as part of the MATS program. The write-up is available at this Google Doc link