This has been my favorite video so far to make! I think interpretability is so important both in terms of ensuring safe AI and also making our AI models more useful to humans.
I recommend reading these papers:
Toy Models of Superposition: https://transformer-circuits.pub/2022...
Towards Monosemanticity: https://transformer-circuits.pub/2023...
Scaling Monosemanticity: https://transformer-circuits.pub/2024...
コメント