Towards Monosemanticity: Decomposing Language Models With Dictionary Learninghttps://transformer-circuits.pub/2023/monosemantic-features
This paper aims to understand how neural networks work by breaking them down into smaller components that are easier to comprehend. However, it turns out that individual neurons are not easy to understand because they respond to a mix of unrelated inputs. One reason for this is superposition, where a neural network represents more features than it has neurons. The paper explores different approaches to finding interpretable features hidden by superposition, such as sparse autoencoders. By using sparse autoencoders, the researchers were able to extract meaningful features from a neural network and analyze its behavior. The paper provides detailed investigations, global analyses, and visualizations to support their findings.