# Harmon

**Repository Path**: KennyAss/Harmon

## Basic Information

- **Project Name**: Harmon
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: BSD-3-Clause
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-10-22
- **Last Updated**: 2025-10-22

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Harmon: Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

![](data/method.png)

> **[Harmonizing Visual Representations for Unified Multimodal Understanding and Generation](https://arxiv.org/abs/2503.21979)**
>
> Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, Chen Change Loy
>
> [![arXiv](https://img.shields.io/badge/arXiv-2503.21979-b31b1b.svg)](https://arxiv.org/abs/2503.21979)
> [![Project Page](https://img.shields.io/badge/Project-Page-green)](https://wusize.github.io/projects/Harmon)
> [![HuggingFace](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-orange)](https://huggingface.co/wusize/Harmon-1_5B)
> [![HuggingFace Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/wusize/Harmon)
> [![Bibtex](https://img.shields.io/badge/Cite-BibTeX-blue)](https://github.com/wusize/Harmon?tab=readme-ov-file#-citation)

## Introduction

**Harmon** is a novel unified framework for multimodal understanding and generation. Unlike existing state-of-the-art
architectures that disentangle visual understanding and generation with different encoder models, the proposed framework harmonizes
the visual presentations of understanding and generation via a shared MAR encoder. Harmon achieves advanced generation
performance on mainstream text-to-image generation benchmarks, and exhibits competitive results on multimodal understanding
tasks. In this repo, we provide inference code to run Harmon for image understanding (image-to-text) and text-to-image
generation, with two model variants Harmon-0.5B and Harmon-1.5B.

## 🚀 Project Status

| Task | Status |
|------|--------|
| 🛠️ Inference Code & Model Checkpoints | ✅ Released |
| 🌐 Project Page | ✅ Finished |
| 🤗 Online Demo |  ✅ [Finished](https://huggingface.co/spaces/wusize/Harmon) |
| 🔄 Finetuning Code | ✅ Released |


### 🔄 Update
We fine-tuned Harmon-1.5B using [BLIP3o-60k](https://huggingface.co/datasets/BLIP3o/BLIP3o-60k) dataset. During fine-tuning, we only updated the parameters of the MAR decoder. The fine-tuned model achieves **0.85** on GenEval. The model checkpoint is available at [harmon_1.5b-o.pth](https://huggingface.co/wusize/harmon/blob/main/harmon_1.5b-o.pth).


## Usage

### 📦 Required Packages
```text
mmengine
transformers==4.45.2
timm==0.9.12
flash_attn==2.3.4
```

### 📥 Checkpoints

Download the model checkpoints from 🤗 [wusize/harmon](https://huggingface.co/wusize/harmon) and organize them as follows:
```text
Harmon/
├── checkpoints
    ├── kl16.ckpt
    ├── harmon_0.5b.pth
    ├── harmon_1.5b.pth
    ├── harmon_1.5b-o.pth  # Fine-tuned model on BLIP3o-60k
```
It is recommended to use the following command to download the checkpoints
```bash
# pip install -U "huggingface_hub[cli]"
huggingface-cli download wusize/harmon  --local-dir checkpoints --repo-type model
```

### 🖌️ Image-to-text Generation

```shell
export PYTHONPATH=./:$PYTHONPATH
python scripts/image2text.py configs/models/qwen2_5_1_5b_kl16_mar_h.py \
         --checkpoint checkpoints/harmon_1.5b.pth  --image_size 512 \
         --image data/view.jpg --prompt "Describe the image in detail."
```

### 🖼️ Text-to-image Generation

You can generate images from text prompts using the following command:

```shell
export PYTHONPATH=./:$PYTHONPATH
python scripts/text2image.py configs/models/qwen2_5_1_5b_kl16_mar_h.py \
         --checkpoint checkpoints/harmon_1.5b.pth  --image_size 512 \
         --prompt 'a dog on the left and a cat on the right.'  --output output.jpg
```

To generate a list of images based on prompts in a json file.
```shell
export PYTHONPATH=./:$PYTHONPATH
accelerate launch scripts/batch_text2image.py configs/models/qwen2_5_1_5b_kl16_mar_h.py \
       --checkpoint checkpoints/harmon_1.5b.pth  --image_size 512 \
       --data path/to/xxx.json --output output --batch_size 4 --grid_size 2
```
The json file should look like:

```json
[
  {
   "prompt": "a dog on the left and a cat on the right."
  }
]
```


### 🤗 Loading Models from Huggingface

We have also converted our models to Huggingface format. You can directly load Harmon models from Huggingface using the `transformers` library:

```
from transformers import AutoTokenizer, AutoModel
harmon_tokenizer = AutoTokenizer.from_pretrained("wusize/Harmon-0_5B",
                                                 trust_remote_code=True)
harmon_model = AutoModel.from_pretrained("wusize/Harmon-0_5B",
                                         trust_remote_code=True).eval().cuda().bfloat16()
```

For more information on the usage of HF-based models, refer to the model cards in 

| Model Variant | LLM | MAR | Hugging Face Hub |
|:-------------:|:---:|:---:|:----------------:|
| **Harmon-0.5B** | Qwen2.5-0.5B-Instruct | MAR-Base | [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-orange)](https://huggingface.co/wusize/Harmon-0_5B) |
| **Harmon-1.5B** | Qwen2.5-1.5B-Instruct | MAR-Huge | [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-orange)](https://huggingface.co/wusize/Harmon-1_5B) |


### 🔄 Finetuning Harmon

For instructions on how to finetune Harmon models on your custom datasets, please refer to our detailed guide in [FINETUNE.md](FINETUNE.md).


## 📚 Citation

If you find Harmon useful for your research or applications, please cite our paper using the following BibTeX:

```bibtex
@article{wu2025harmon,
      title={Harmonizing Visual Representations for Unified Multimodal Understanding and Generation}, 
      author={Size Wu and Wenwei Zhang and Lumin Xu and Sheng Jin and Zhonghua Wu and Qingyi Tao and Wentao Liu and Wei Li and Chen Change Loy},
      year={2025},
      eprint={2503.21979},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.21979}, 
}
```

## 📜 License
This project is licensed under [NTU S-Lab License 1.0](LICENSE).


## 🙏 Acknowledgement
The project builds upon the following open-source efforts:
- [Qwen2.5](https://github.com/QwenLM/Qwen2.5): We use LLMs from Qwen2.5, including Qwen2.5-0.5B-Instruct and Qwen2.5-1.5B-Instruct.

- [MAR](https://github.com/LTH14/mar): The image generation pipeline is retrofitted from MAR.