# TAM

**Repository Path**: 910024445/TAM

## Basic Information

- **Project Name**: TAM
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-07-04
- **Last Updated**: 2025-07-04

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Token Activation Map to Visually Explain Multimodal LLMs
We introduce the Token Activation Map (TAM), a groundbreaking method that cuts through the contextual noise in Multimodal LLMs. This technique produces exceptionally clear and reliable visualizations, revealing the precise visual evidence behind every word the model generates.

[![arXiv](https://img.shields.io/badge/arXiv-2506.23270-brown?logo=arxiv&style=flat-square)](https://arxiv.org/abs/2506.23270)


![Overview](imgs/overview.jpg)
(a) The overall framework of TAM. (b) Details of the estimated casual inference module. (c) Details of the rank Gaussian filter module. (d) Fine-grained evaluation metrics.

### Installation
* python packages:
```
pip install -r requirements.txt
```
* latex for text visualization:
```
sudo apt-get update
sudo apt-get install texlive-xetex
```

### Demo
* A demo for qualitative results
```
python demo.py
```
Note: The demo supports both image and video inputs; update the inputs accordingly for other scenarios.


### Eval
* Download the formatted datasets for eval at [[COCO14+GranDf+OpenPSG](https://hkustconnect-my.sharepoint.com/:u:/g/personal/ylini_connect_ust_hk/EXL-stkCxk5DnwRkNw9MgSABu1vFPv_0FI60yxl0OYxSGQ?e=V3qjHh)] or [huggingface](https://huggingface.co/datasets/yili7eli/TAM/tree/main).
* Evaluation for quantitative results
```
# python eval.py [model_name] [dataset_path] [vis_path (visualize if given)]

python eval.py Qwen/Qwen2-VL-2B-Instruct data/coco2014
```
Note: Results may vary slightly depending on the CUDA, device, and package versions.


### Custom model
* Step1: load the custom model
* Step2: get the logits from transformers
```
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    use_cache=True,
    output_hidden_states=True, # ---> TAM needs hidden states
    return_dict_in_generate=True
)
logits = [model.lm_head(feats[-1]) for feats in outputs.hidden_states]
```
* Step3: prepare input args
```
# used to split tokens
# note: 1. The format is [int/list for start, int/list for end].
#       2. The select tokens are [start + 1: end].
#       3. The start list uses the idx of last token, while end uses the first.

special_ids = {'img_id': [XXX, XXX], 'prompt_id': [XXX, XXX], 'answer_id': [XXX, XXX]}

# output vision map shape (h, w)
vision_shape = (XXX, XXX)
```
* Step4: run TAM() to vis each token
```
# Call TAM() to generate token activation map for each generation round
# Arguments:
# - token ids (inputs and generations)
# - shape of vision token
# - logits for each round
# - special token identifiers for localization
# - image / video inputs for visualization
# - processor for decoding
# - output image path to save the visualization
# - round index (0 here)
# - raw_vis_records: list to collect intermediate visualization data
# - eval only, False to vis
# return TAM vision map for eval, saving multimodal TAM in the function

raw_map_records = []
for i in range(len(logits)):
    img_map = TAM(
        generated_ids[0].cpu().tolist(),
        vision_shape,
        logits,
        special_ids,
        vis_inputs,
        processor,
        os.path.join(save_dir, str(i) + '.jpg'),
        i,
        raw_map_records,
        False)
```
* Note: see detailed comments in tam.py about TAM()


## LICENSE
This project is licensed under the MIT License.

## Citation
```
@misc{li2025tokenactivationmapvisually,
      title={Token Activation Map to Visually Explain Multimodal LLMs}, 
      author={Yi Li and Hualiang Wang and Xinpeng Ding and Haonan Wang and Xiaomeng Li},
      year={2025},
      eprint={2506.23270},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.23270}, 
}
```