Published at: Jun 3, 2024

Unlearning with sparse autoencoders

We trained sparse autoencoders on the open-source language model Pythia-2.8b to use for unlearning Harry Potter related knowledge. We can successfully unlearn significant levels of Harry Potter related knowledge with little to no side effects. This technique is worth exploring further.

I completed this in a two week research sprint along with Evan Anders as part of the MATS program. The write-up is available at this Google Doc link