mechanistic-interpretability

Here are 30 public repositories matching this topic...

stanfordnlp / pyvene

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions

intervention interpretability mechanistic-interpretability activation-intervention activation-patching

Updated May 25, 2024
Python

Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms

program-synthesis knowledge-distillation inductive-logic-programming domain-adaptation explainable-ai interpretable distilling neurosymbolic model-distillation out-of-distribution-generalization mechanistic-interpretability

Updated Feb 20, 2024
Python

jbloomAus / DecisionTransformerInterpretability

Star

Interpreting how transformers simulate agents performing RL tasks

reinforcement-learning mechanistic-interpretability

Updated Oct 23, 2023
Jupyter Notebook

apartresearch / interpretability-starter

Star

🧠 Starter templates for doing interpretability research

interpretability interpretability-jam alignment-jam mechanistic-interpretability

Updated Jul 16, 2023

wesg52 / sparse-probing-paper

Star

Sparse probing paper full code.

ai-safety interpretability ai-alignment mechanistic-interpretability

Updated Dec 17, 2023
Jupyter Notebook

taufeeque9 / codebook-features

Star

Sparse and discrete interpretability tool for neural networks

transformers features language-model interpretability codebook mechanistic-interpretability

Updated Feb 12, 2024
Python

microsoft / automated-explanations

Star

Explain a black-box module in natural language.

data-science machine-learning neuroscience artificial-intelligence fmri gpt explanation language-model interpretability xai fmri-data-analysis huggingface gpt4 large-language-models mechanistic-interpretability automated-interpretability

Updated May 27, 2024
HTML

epfl-dlab / llm-latent-language

Star

Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".

multilingual-nlp llm mechanistic-interpretability llama2

Updated Mar 11, 2024
Jupyter Notebook

steering-vectors / steering-vectors

Star

Steering vectors for transformer language models in Pytorch / Huggingface

nlp ai pytorch gpt huggingface mechanistic-interpretability representation-engineering

Updated Apr 3, 2024
Python

aryamanarora / causalgym

Star

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

benchmark causality interpretability mechanistic-interpretability syntaxgym

Updated Feb 27, 2024
Python

wesg52 / universal-neurons

Star

Universal Neurons in GPT2 Language Models

ai-safety interpretability llm mechanistic-interpretability

Updated May 28, 2024
Jupyter Notebook

Nix07 / finetuning

Star

This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".

finetuning entity-tracking mechanistic-interpretability science-of-deep-learning

Updated Mar 21, 2024
Jupyter Notebook

OpenMOSS / Language-Model-SAEs

Star

For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research. Open-sourced and constantly updated.

interpretability mechanistic-interpretability

Updated May 22, 2024
Jupyter Notebook

koayon / atp_star

Star

PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)

machine-learning large-language-models mechanistic-interpretability

Updated Apr 16, 2024
Python

apartresearch / deepdecipher

Star

🦠 DeepDecipher: An open source API to MLP neurons

api website machine-learning research academic interpretability interpretability-methods interpretability-jam mechanistic-interpretability

Updated May 2, 2024
Rust

francescortu / comp-mech

Star

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

interpretability llm mechanistic-interpretability

Updated May 24, 2024
Python

DeanHazineh / Emergent-World-Representations-Othello

Star

A mechanistic interpretability study invvestigating a sequential model trained to play the board game Othello

intervention othello-ai gpt-2 mechanistic-interpretability

Updated Apr 24, 2024
Jupyter Notebook

Nix07 / binding-circuit-discovery

Star

This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".

mechanistic-interpretability science-of-deep-learning

Updated Mar 12, 2024
Python

daspartho / pronoun-prediction

Star

Identifying Circuit behind Pronoun Prediction in GPT-2 Small

interpretability gpt-2 mechanistic-interpretability

Updated May 24, 2023
Jupyter Notebook

evan-lloyd / graphpatch

Star

graphpatch is a library for activation patching on PyTorch neural network models.

pytorch interpretability large-language-models mechanistic-interpretability

Updated May 23, 2024
Python

Improve this page

Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mechanistic-interpretability

Here are 30 public repositories matching this topic...

stanfordnlp / pyvene

pauljblazek / deepdistilling

jbloomAus / DecisionTransformerInterpretability

apartresearch / interpretability-starter

wesg52 / sparse-probing-paper

taufeeque9 / codebook-features

microsoft / automated-explanations

epfl-dlab / llm-latent-language

steering-vectors / steering-vectors

aryamanarora / causalgym

wesg52 / universal-neurons

Nix07 / finetuning

OpenMOSS / Language-Model-SAEs

koayon / atp_star

apartresearch / deepdecipher

francescortu / comp-mech

DeanHazineh / Emergent-World-Representations-Othello

Nix07 / binding-circuit-discovery

daspartho / pronoun-prediction

evan-lloyd / graphpatch

Improve this page

Add this topic to your repo