# Concept-Agnostic-Attention-Module-Discovery-in-Transformers **Repository Path**: mirrors_facebookresearch/Concept-Agnostic-Attention-Module-Discovery-in-Transformers ## Basic Information - **Project Name**: Concept-Agnostic-Attention-Module-Discovery-in-Transformers - **Description**: Official implementation of "From Concepts to Components Concept-Agnostic Attention Module Discovery in Transformers" by Jingtong Su, Julia Kempe and Karen Ullrich. - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-09-21 - **Last Updated**: 2025-09-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers This is the official repository for our [paper](https://www.arxiv.org/abs/2506.17052). We propose Scalable Attention Module Discovery (SAMD) in transformers, to map arbitrary concepts to specific attention heads. We then propose Scalar Attention Module Intervention (SAMI), a strategy to diminish or amplify the effects of a concept by adjusting the attention module. Please 🌟star🌟 this repo and cite our paper πŸ“œ if you like (and/or use) our work, thank you! ## πŸš€ Quick Start - ### βš™οΈ Environment Preparation ```bash conda create -n attention_discovery python=3.10 conda activate attention_discovery pip install -r requirements.txt ``` - ### πŸ“¦ Dataset Preparation We provide a bash script to collect the positive/negative datasets used in our paper. ```bash bash src/prepare_dataset.sh ``` - ### πŸ§ͺ Representation Preparation We provide example scripts for collecting cached representations of sparse autoencoder prompts, safety prompts, and reasoning prompts respectively. If you have downloaded the corresponding huggingface models under MODEL_PATH, you can pass the argument; otherwise the model will be loaded from huggingface hub. ```bash cd src python collect_gemma_sae_rep.py --name dog # dog, yelling, sf, or french python collect_safety_rep.py --model llama # llama, gemma or qwen python collect_cot_rep.py --model Llama3.1 # Llama3.1 or gemma ``` - ### πŸ““ Notebook Launch After collecting the representations, we are ready to find the attention modules and apply interventions on them. Play with our ```summary_notebook``` to visualize the attention module and see outcomes after intervention! ## Acknowledgement The sparse autoencoder prompt data is obtained from [neuronpedia](https://www.neuronpedia.org/). The safety prompt data is obtained from the [refusal direction](https://github.com/andyrdt/refusal_direction) repository. The Chain-of-Thought data is obtained from the [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) pipeline. ## Citation If you use our work in your research, please cite: ``` @article{su2025concepts, title={From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers}, author={Su, Jingtong and Kempe, Julia and Ullrich, Karen}, journal={arXiv preprint arXiv:2506.17052}, year={2025} } ``` ## Legal Our work is licenced under CC-BY-NC, please refer to the [LICENSE](LICENSE) file in the top level directory. Copyright Β© Meta Platforms, Inc. See the [Terms of Use](https://opensource.fb.com/legal/terms/) and [Privacy Policy](https://opensource.fb.com/legal/privacy/) for this project.