Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
-
Updated
May 25, 2024 - Python
Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
Interpreting how transformers simulate agents performing RL tasks
🧠 Starter templates for doing interpretability research
Sparse probing paper full code.
Sparse and discrete interpretability tool for neural networks
Explain a black-box module in natural language.
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
Steering vectors for transformer language models in Pytorch / Huggingface
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Universal Neurons in GPT2 Language Models
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research. Open-sourced and constantly updated.
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
🦠 DeepDecipher: An open source API to MLP neurons
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
A mechanistic interpretability study invvestigating a sequential model trained to play the board game Othello
This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".
Identifying Circuit behind Pronoun Prediction in GPT-2 Small
graphpatch is a library for activation patching on PyTorch neural network models.
Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.
To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."