Experiments with Sparse Autoencoders on Attention Heads

Note: This work was done in a two week research sprint as part of the MATS program

Summary

The main goal of this project was to apply sparse autoencoders to gain insight into the behaviour of several kinds of attention heads. To do this, I trained sparse autoencoders on the key and query vectors of previous token heads and induction heads of attn-only-2l and gpt2-small. The key findings are as follows:

I found interpretable features in all of the attention heads. Between 20% and ~100% of alive features were interpretable, depending on the training parameters.
In previous token heads, the features trained on query and key vectors are strongly dependent on the position within the context. In attn-only-2l, each feature corresponds almost exclusively to one position in the context. In gpt2-small, the positional dependence of each feature is spread more diffusively over 10 – 20 positions.
In induction heads of attn-only-2l and gpt2-small, query and key features are usually of the form “I am [X]” and “I follow [X]” respectively, where [X] can be a single common token or a set of related less common tokens. The attention patterns can be well reproduced using these features. I found that I can causally intervene on the key vectors to delete “I follow ·D” such that induction on the token ·D no longer works.
In some cases, I found that one feature accounted for ~90% of the probability assigned for the next token by the induction head despite an L0 norm of ~100, i.e. removing 99/100 activated features left the probability for the most common token mostly unchanged, and removing just the one strongest feature decreased the probability by a factor of 100. This suggests that the L0 norm may not be a good metric in certain situations.
I could not identify a clear difference between the different induction heads in gpt2-small.

Motivation

Attention heads are difficult to understand yet are a critical component in understanding how LLMs work. Sparse autoencoders (SAEs) have been demonstrated to be useful tools in extracting features from MLP layers, both at mlp_post and mlp_out. Applying SAEs to attention heads might be a useful way to make it clear what they are doing. If I can improve or clarify the behaviour of attention heads that are already well understood, that would be a useful test of applying SAEs to attention heads. If I were able to use an SAE to show that an attention head does something that was not previously known about that attention head and that is difficult to show using other methods, that would be a good proof-of-concept for applying SAEs to attention heads.

Previous Token Heads

Attn-only-2l (L0H3)

Summary

Metric	Query	Key
L0 Norm	5.2	4.3
Recovered loss	97%	92%
# Alive features	250	256
# Features	256	256
L1 coefficient	0.03	0.03

Trained an SAE on each of the query and key vectors
Almost every query and key feature corresponds very strongly to one position in the context. Some correspond to two adjacent positions, and a small number (~10) are non positional dependent
Query and key features can dot product to reproduce the previous token pattern i.e. query corresponding to position 40 dot products strongly with key corresponding to position 39 (see figure and discussion below)
As an initial test, I set the context length to be 128 in the SAE training set and put 128 features in the SAE. Most of the features recovered in both the key and query SAEs correspond to particular directions. Because I didn’t recover a feature for all positions, I decided to run models with 256 features and a context length of 128. This recovered almost all of the positions (>120 out of 128).
The attention pattern can be reconstructed based on the query and key vectors, and it works in combination with the induction head in the second layer of the two layer model.

Typical Query Features

These are representative examples of max activations for two query features. Note that all the top 20 max activations are at the same position within a given feature. We can also see that there is some relatively weak token dependence of the feature activations.

Typical Key Features

These are representative examples of max activations for two key features. Like the query features, all the top 20 max activations are at the same position within a given feature.

Positional Dependence of Features

While the top 20 max activations for each query and key features have the same position in the context, one may wonder about the positions of the rest of the tokens on which a given feature activates (i.e. does query feature at position 40 also activate strongly for positions 39, for instance). To check this, I take the top 2000 max activating tokens (out of 500k, so ~4k per position) for each feature and plot the normalised distribution of the context positions. The figure below shows these position distributions for the query features. Only every 5th feature is plotted for clarity.

Distribution of the context positions of top 2000 max activating tokens per feature for SAE trained on previous token head.

We can notice that the majority of features have the same position at 100% of the top 2000 tokens (out of 500k, so ~4k per position). Some features have a roughly equal proportion of max activations at two adjacent positions, e.g. the dark green feature #15 at position ~65 in the figure. A small number of features are not dependent on position (see discussion in next subsection).

If you plot a similar positional dependence across the full data distribution (all 500k test tokens), each feature activates very weakly for a variety of other tokens. This appears to be related to the weak token dependence and is the reason the L0 norms are 5 and 4 for the query and key vectors respectively, rather than ~1 or ~2.

The same results are obtained for the features trained on the key vectors.

Non-Positional Features

While most of the features are positional based, there are some features that are token based. The figures below show one example. Key feature #120 fires on tokens that consist of subsets of words that are usually followed by a vowel, e.g ·qu, ·scr, ·gr, ·bl, ·sp ·st, ·sp, ·cr, ·cl (see max activations below). Out of the query features, key feature #120 has the largest dot product with query feature #133, the max activations of which are also shown below. Query feature #133 fires on tokens that consist of subsets of words beginning with vowels that one would typically find at the end of words and especially following the tokens that the key feature fires on e.g. ·ist, ·er, ·ed, ·ian, ·ant, ·ant (see max activations below). A next step to interpret this would be to train an SAE on the value vectors and see what is written out based on the attention constructed by these two features. But it’s clearly connecting together subsets of words. Note that while this is interesting to notice, in the case of attn-only-2l I think it is also possible to expand QK matrix pairs between all sets of tokens to achieve a similar result.

Max activating examples for non positional features in the previous token head. Left: Key Feature #120. Right: Query Feature #133

Reconstructed Attention Pattern

The attention pattern for the previous token head L0H3 can be reconstructed successfully using the key and query features across a range of tokens. Below is a visual example with the prompt “When Mary and John went to the store, John gave a drink to Mary”. The reconstructed pattern gets the positional dependence correct, and also does reasonably well for the weak token dependence. This reconstructed pattern can be connected with the induction head in layer 1 to perform induction successfully.

Original attention pattern Reconstructed attention pattern

Left: Original attention pattern for the prvious token head. Right: Reconstructed attention pattern using the sparse autoencoder. This reconstructed pattern can combine with the induction head in L1 to perform induction.

Dot Product of Features sorted by Position

Since we have sets of key features and query features that correspond to one position in the context, one could imagine that the dot products between each key and query vector should reflect the behaviour of previous token heads. For instance, one might expect that the query feature that corresponds to position #20 should have the largest dot product with the key feature that corresponds to position #19.

The figure shows the dot product between a set of ~128 features from the query and key SAEs ordered by the position they correspond to. This simulates the attention scores between each feature. We can see that we recover the behaviour of a previous token head (the dark blue line is one below the main diagonal). The solid lines indicate context positions for which there is no single query or key feature corresponding to that position. These positions are represented by multiple features which allows the reconstructed attention pattern shown earlier to be correct.

Dot Product of Features with Positional Embeddings

We can also check how the positional and key features that correspond to each position dot product with the positional embeddings of the model. The two figures below show such a calculation for the key and query features ordered by the position. The dark blue line is along the main diagonal indicating that the key and query features are picking up on the positional embeddings.

Dot product between Key Features and Positional Embeddings

Dot product between Query Features and Positional Embeddings