# Concept-Agnostic-Attention-Module-Discovery-in-Transformers

**Repository Path**: mirrors_facebookresearch/Concept-Agnostic-Attention-Module-Discovery-in-Transformers

## Basic Information

- **Project Name**: Concept-Agnostic-Attention-Module-Discovery-in-Transformers
- **Description**: Official implementation of "From Concepts to Components Concept-Agnostic Attention Module Discovery in Transformers" by Jingtong Su, Julia Kempe and Karen Ullrich. 
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-09-21
- **Last Updated**: 2025-09-22

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers

This is the official repository for our [paper](https://www.arxiv.org/abs/2506.17052). We propose Scalable Attention Module Discovery (SAMD) in transformers, to map arbitrary concepts to specific attention heads. We then propose Scalar Attention Module Intervention (SAMI), a strategy to diminish or amplify the effects of a concept by adjusting the attention module.

Please 🌟star🌟 this repo and cite our paper 📜 if you like (and/or use) our work, thank you! 

## 🚀 Quick Start

- ### ⚙️ Environment Preparation
```bash
conda create -n attention_discovery python=3.10
conda activate attention_discovery
pip install -r requirements.txt
```

- ### 📦 Dataset Preparation

We provide a bash script to collect the positive/negative datasets used in our paper.
```bash
bash src/prepare_dataset.sh
```

- ### 🧪 Representation Preparation

We provide example scripts for collecting cached representations of sparse autoencoder prompts, safety prompts, and reasoning prompts respectively. If you have downloaded the corresponding huggingface models under MODEL_PATH, you can pass the argument; otherwise the model will be loaded from huggingface hub.

```bash
cd src
python collect_gemma_sae_rep.py --name dog # dog, yelling, sf, or french
python collect_safety_rep.py --model llama # llama, gemma or qwen
python collect_cot_rep.py --model Llama3.1 # Llama3.1 or gemma
```

- ### 📓 Notebook Launch

After collecting the representations, we are ready to find the attention modules and apply interventions on them. Play with our ```summary_notebook``` to visualize the attention module and see outcomes after intervention!

## Acknowledgement

The sparse autoencoder prompt data is obtained from [neuronpedia](https://www.neuronpedia.org/). The safety prompt data is obtained from the [refusal direction](https://github.com/andyrdt/refusal_direction) repository. The Chain-of-Thought data is obtained from the [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) pipeline.

## Citation

If you use our work in your research, please cite:

```
@article{su2025concepts,
  title={From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers},
  author={Su, Jingtong and Kempe, Julia and Ullrich, Karen},
  journal={arXiv preprint arXiv:2506.17052},
  year={2025}
}
```

## Legal

Our work is licenced under CC-BY-NC, please refer to the [LICENSE](LICENSE) file in the top level directory.

Copyright © Meta Platforms, Inc. See the [Terms of Use](https://opensource.fb.com/legal/terms/) and [Privacy Policy](https://opensource.fb.com/legal/privacy/) for this project.