AI

Posts related to AI & Machine Learning

Published at

Jun 3, 2024

Unlearning with sparse autoencoders

AI

We trained sparse autoencoders on the open-source language model Pythia-2.8b to use for unlearning Harry Potter related knowledge. We can successfully unlearn significant levels of Harry Potter related knowledge with little to no side effects. This technique is worth exploring further.
Published at

Apr 16, 2024

Experiments with an alternative method to promote sparsity in sparse autoencoders

AI

I experimented with alternatives to the standard L1 penalty used to promote sparsity in sparse autoencoders (SAEs). I found that including terms based on an alternative differentiable approximation of the feature sparsity in the loss function was an effective way to generate sparsity in SAEs trained on the residual stream of GPT2-small.
Published at

Dec 21, 2023

Experiments with Sparse Autoencoders on Attention Heads

AI

I trained sparse autoencoders on the key and query vectors of previous token heads and induction heads of attn-only-2l and gpt2-small and found interpretable features which I could intervene on in a predictable and interpretable way.
Published at

Oct 6, 2023

'Recency bias' in an induction head

AI

The induction head in a 2-layer attention-only transformer model has a slight bias towards tokens later in the context compared to earlier. Interestingly, its notion of position appears to not depend on positional embeddings, or any specific output from an attention head in the previous layer.
Published at

Oct 3, 2023

Induction head circuits for longer sequences

AI

In a 2-layer attention-only transformer model, an induction head can combine with an "averaging" head that stores some kind of average over the previous ~4-5 tokens to produce a circuit that can predict the next token in repeated sequences of length 2 to 5 .
Published at

Oct 2, 2023

The previous token head and the "look-back-two" head

AI

A few plots on previous tokens heads, a discussion of how they work and a comparison to a similar type of attention head -- a "look-back-two" head.
Published at

Oct 1, 2023

Positional Embeddings in a 2-layer attention-only transformer model

AI

This post contains some visualisations and discussion of positional embeddings. The position embeddings in a 2-layer attention-only transformer model arrange themselves into a helical structure. This presumably allows the model to generate QK matrices to move a few positions in relative terms with a similar transformation for all positions. The positional embeddings at positions 0 and 1023 have special properties.

AI

Unlearning with sparse autoencoders

Experiments with an alternative method to promote sparsity in sparse autoencoders

Experiments with Sparse Autoencoders on Attention Heads

'Recency bias' in an induction head

Induction head circuits for longer sequences

The previous token head and the "look-back-two" head

Positional Embeddings in a 2-layer attention-only transformer model