AI

Posts related to AI & Machine Learning

  • Published at
    We trained sparse autoencoders on the open-source language model Pythia-2.8b to use for unlearning Harry Potter related knowledge. We can successfully unlearn significant levels of Harry Potter related knowledge with little to no side effects. This technique is worth exploring further.
  • Published at
    I experimented with alternatives to the standard L1 penalty used to promote sparsity in sparse autoencoders (SAEs). I found that including terms based on an alternative differentiable approximation of the feature sparsity in the loss function was an effective way to generate sparsity in SAEs trained on the residual stream of GPT2-small.
  • Published at
    I trained sparse autoencoders on the key and query vectors of previous token heads and induction heads of attn-only-2l and gpt2-small and found interpretable features which I could intervene on in a predictable and interpretable way.
  • Published at
    The induction head in a 2-layer attention-only transformer model has a slight bias towards tokens later in the context compared to earlier. Interestingly, its notion of position appears to not depend on positional embeddings, or any specific output from an attention head in the previous layer.
  • Published at
    In a 2-layer attention-only transformer model, an induction head can combine with an "averaging" head that stores some kind of average over the previous ~4-5 tokens to produce a circuit that can predict the next token in repeated sequences of length 2 to 5 .
  • Published at
    This post contains some visualisations and discussion of positional embeddings. The position embeddings in a 2-layer attention-only transformer model arrange themselves into a helical structure. This presumably allows the model to generate QK matrices to move a few positions in relative terms with a similar transformation for all positions. The positional embeddings at positions 0 and 1023 have special properties.