# align-anything **Repository Path**: zzb32/align-anything ## Basic Information - **Project Name**: align-anything - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-10-18 - **Last Updated**: 2024-10-18 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

project website ^HOT PKU-Alignment Team ^welcome

[![PyPI](https://img.shields.io/pypi/v/align-anything?logo=pypi)](https://pypi.org/project/align-anything) [![License](https://img.shields.io/github/license/PKU-Alignment/align-anything?label=license)](#license) [📘Documentation](https://pku-alignment.notion.site/Align-Anything-37a300fb5f774bb08e5b21fdeb476c64) | [🆕Update News](#news) | [🛠️Quick Start](#quick-start) | [🚀Algorithms](#algorithms) | [👀Evaluation](#evaluation) | [🤔Reporting Issues](#report-issues)

[Our 100K Instruction-Following Datasets](https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K)

Align-Anything aims to align any modality large models (any-to-any models), including LLMs, VLMs, and others, with human intentions and values. More details about the definition and milestones of alignment for Large Models can be found in [AI Alignment](https://alignmentsurvey.com). Overall, this framework has the following characteristics: - **Highly Modular Framework.** Its versatility stems from the abstraction of different algorithm types and well-designed APIs, allowing users to easily modify and customize the code for different tasks. - **Support for Various Model Fine-Tuning.** This framework includes fine-tuning capabilities for models such as LLaMA3.1, LLaVA, Gemma, Qwen, Baichuan, and others (see [Model Zoo](https://github.com/PKU-Alignment/align-anything/blob/main/Model-Zoo.md)). - **Support Fine-Tuning across Any Modality.** It supports fine-tuning alignments for different modality model, including LLMs, VLMs, and other modalities (see [Development Roadmap](#development-roadmap)). - **Support Different Alignment Methods.** The framework supports different alignment algorithms, including SFT, DPO, PPO, and others. ||

prompt

Small white toilet sitting in a small corner next to a wall.

prompt

A close up of a neatly made bed with two night stands

prompt

A pizza is sitting on a plate at a restaurant.

prompt

A girl in a dress next to a piece of luggage and flowers.

| |---| ---------------------------------- | --- | --- | --- | |Before Alignment ([Chameleon-7B](https://huggingface.co/facebook/chameleon-7b))|

| |**After Alignment ([Chameleon 7B Plus](https://huggingface.co/PKU-Alignment/AA-chameleon-7b-plus))**|

| > Alignment fine-tuning can significantly enhance the instruction-following capabilities of large multimodal models. After fine-tuning, Chameleon 7B Plus generates images that are more relevant to the prompt. ## Algorithms We support basic alignment algorithms for different modalities, each of which may involve additional algorithms. For instance, in the text modality, we have also implemented SimPO, KTO, and others. | Modality | SFT | RM | DPO | PPO | | ---------------------------------- | --- | --- | --- | --- | | `Text -> Text (t2t)` | ✔️ | ✔️ | ✔️ | ✔️ | | `Text+Image -> Text (ti2t)` | ✔️ | ✔️ | ✔️ | ✔️ | | `Text+Image -> Text+Image (ti2ti)` | ✔️ | ✔️ | ✔️ | ✔️ | | `Text+Audio -> Text (ta2t)` | ✔️ | ✔️ | ✔️ | ✔️ | | `Text+Video -> Text (tv2t)` | ✔️ | ✔️ | ✔️ | ✔️ | | `Text -> Image (t2i)` | ✔️ | ⚒️ | ✔️ | ⚒️ | | `Text -> Video (t2v)` | ✔️ | ⚒️ | ✔️ | ⚒️ | | `Text -> Audio (t2a)` | ✔️ | ⚒️ | ✔️ | ⚒️ | ## Evaluation We support evaluation datasets for `Text -> Text`, `Text+Image -> Text` and `Text -> Image`. | Modality | Supported Benchmarks | | :-------------------- | :----------------------------------------------------------- | | `t2t` | [ARC](https://huggingface.co/datasets/allenai/ai2_arc), [BBH](https://huggingface.co/datasets/lukaemon/bbh), [Belebele](https://huggingface.co/datasets/facebook/belebele), [CMMLU](https://huggingface.co/datasets/haonan-li/cmmlu), [GSM8K](https://huggingface.co/datasets/openai/gsm8k), [HumanEval](https://huggingface.co/datasets/openai/openai_humaneval), [MMLU](https://huggingface.co/datasets/cais/mmlu), [MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro), [MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts), [PAWS-X](https://huggingface.co/datasets/google-research-datasets/paws-x), [RACE](https://huggingface.co/datasets/ehovy/race), [TruthfulQA ](https://huggingface.co/datasets/truthfulqa/truthful_qa) | | `ti2t` | [A-OKVQA](https://huggingface.co/datasets/HuggingFaceM4/A-OKVQA), [LLaVA-Bench(COCO)](https://huggingface.co/datasets/lmms-lab/llava-bench-coco), [LLaVA-Bench(wild)](https://huggingface.co/datasets/lmms-lab/llava-bench-in-the-wild), [MathVista](https://huggingface.co/datasets/AI4Math/MathVista), [MM-SafetyBench](https://huggingface.co/datasets/PKU-Alignment/MM-SafetyBench), [MMBench](https://huggingface.co/datasets/lmms-lab/MMBench), [MME](https://huggingface.co/datasets/lmms-lab/MME), [MMMU](https://huggingface.co/datasets/MMMU/MMMU), [MMStar](https://huggingface.co/datasets/Lin-Chen/MMStar), [MMVet](https://huggingface.co/datasets/lmms-lab/MMVet), [POPE](https://huggingface.co/datasets/lmms-lab/POPE), [ScienceQA](https://huggingface.co/datasets/derek-thomas/ScienceQA), [SPA-VL](https://huggingface.co/datasets/sqrti/SPA-VL), [TextVQA](https://huggingface.co/datasets/lmms-lab/textvqa), [VizWizVQA](https://huggingface.co/datasets/lmms-lab/VizWiz-VQA) | |`tv2t` |[MVBench](https://huggingface.co/datasets/OpenGVLab/MVBench), [Video-MME](https://huggingface.co/datasets/lmms-lab/Video-MME) | |`ta2t` |[AIR-Bench](https://huggingface.co/datasets/qyang1021/AIR-Bench-Dataset) | | `t2i` | [ImageReward](https://huggingface.co/datasets/THUDM/ImageRewardDB), [HPSv2](https://huggingface.co/datasets/zhwang/HPDv2), [COCO-30k(FID)](https://huggingface.co/datasets/sayakpaul/coco-30-val-2014) | | `t2v` | [ChronoMagic-Bench](https://huggingface.co/datasets/BestWishYsh/ChronoMagic-Bench) | | `t2a` | [AudioCaps(FAD)](https://huggingface.co/datasets/AudioLLMs/audiocaps_test) | - ⚒️ : coming soon. # News - 2024-10-10: We support SFT for `Any -> Any` modality models Emu3. - 2024-09-24: We support SFT, DPO, RM and PPO for `Text + Video -> Text` modality models. - 2024-09-13: We support SFT, DPO, RM and PPO for `Text + Audio -> Text` modality models. - 2024-08-17: We support DPO and PPO for `Text+Image -> Text+Image` modality models. - 2024-08-15: We support a new function in the evaluation module: the `models_pk` script in [here](./scripts/models_pk.sh), which enables comparing the performance of two models across different benchmarks. - 2024-08-06: We restructure the framework to support any modality evaluation and the supported benchmark list is [here](https://github.com/PKU-Alignment/align-anything/tree/main/align_anything/evaluation/benchmarks). - 2024-08-06: We support `Text+Image -> Text+Image` modality for the SFT trainer and Chameleon models.

More News

- 2024-07-23: We support `Text -> Image`, `Text -> Audio`, and `Text -> Video` modalities for the SFT trainer and DPO trainer. - 2024-07-22: We support the **Chameleon** model for the SFT trainer and DPO trainer! - 2024-07-17: We open-source the Align-Anything-Instruction-100K dataset for text modality. This dataset is available in both [English](https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K) and [Chinese](https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K-zh) versions, each sourced from different data sets and meticulously refined for quality by GPT-4. - 2024-07-14: We open-source the align-anything framework.

# Installation ```bash # clone the repository git clone git@github.com:PKU-Alignment/align-anything.git cd align-anything # create virtual env conda create -n align-anything python==3.11 conda activate align-anything ``` - **`[Optional]`** We recommend installing [CUDA](https://anaconda.org/nvidia/cuda) in the conda environment and set the environment variable. ```bash # We tested on the H800 computing cluster, and this version of CUDA works well. # You can adjust this version according to the actual situation of the computing cluster. conda install nvidia/label/cuda-12.2.0::cuda export CUDA_HOME=$CONDA_PREFIX ``` > If your CUDA installed in a different location, such as `/usr/local/cuda/bin/nvcc`, you can set the environment variables as follows: ```bash export CUDA_HOME="/usr/local/cuda" ``` Fianlly, install `align-anything` by: ```bash pip install -e . ``` ## Wandb Logger We support `wandb` logging. By default, it is set to offline. If you need to view wandb logs online, you can specify the environment variables of `WANDB_API_KEY` before starting the training: ```bash export WANDB_API_KEY="..." # your W&B API key here ``` # Quick Start ## Training Scripts To prepare for training, all scripts are located in the `./scripts` and parameters that require user input have been left empty. For example, the DPO scripts for `Text + Image -> Text` modality is as follow: ```bash MODEL_NAME_OR_PATH="" # model path TRAIN_DATASETS="" # dataset path TRAIN_TEMPLATE="" # dataset template TRAIN_SPLIT="" # split the dataset OUTPUT_DIR="" # output dir source ./setup.sh # source the setup script export CUDA_HOME=$CONDA_PREFIX # replace it with your CUDA path deepspeed \ --master_port ${MASTER_PORT} \ --module align_anything.trainers.text_image_to_text.dpo \ --model_name_or_path ${MODEL_NAME_OR_PATH} \ --train_datasets ${TRAIN_DATASETS} \ --train_template SPA_VL \ --train_split train \ --output_dir ${OUTPUT_DIR} ``` We can run DPO with [LLaVA-v1.5-7B](https://huggingface.co/llava-hf/llava-1.5-7b-hf) (HF format) and [SPA-VL](https://huggingface.co/datasets/sqrti/SPA-VL) dataset using the follow script: ```bash MODEL_NAME_OR_PATH="llava-hf/llava-1.5-7b-hf" # model path TRAIN_DATASETS="sqrti/SPA-VL" # dataset path TRAIN_TEMPLATE="SPA_VL" # dataset template TRAIN_SPLIT="train" # split the dataset OUTPUT_DIR="../output/dpo" # output dir export WANDB_API_KEY="YOUR_WANDB_KEY" # wandb logging source ./setup.sh # source the setup script export CUDA_HOME=$CONDA_PREFIX # replace it with your CUDA path deepspeed \ --master_port ${MASTER_PORT} \ --module align_anything.trainers.text_image_to_text.dpo \ --model_name_or_path ${MODEL_NAME_OR_PATH} \ --train_datasets ${TRAIN_DATASETS} \ --train_template ${TRAIN_TEMPLATE} \ --train_split ${TRAIN_SPLIT} \ --output_dir ${OUTPUT_DIR} ``` ## Evaluation All evaluation scripts can be found in the `./scripts`. The `./scripts/evaluate.sh` script runs model evaluation on the benchmarks, and parameters that require user input have been left empty. The corresponding script is as follow: ```bash SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" cd "${SCRIPT_DIR}/../align_anything/evaluation" || exit 1 BENCHMARKS=("") # evaluation benchmarks OUTPUT_DIR="" # output dir GENERATION_BACKEND="" # generation backend MODEL_ID="" # model's unique id MODEL_NAME_OR_PATH="" # model path CHAT_TEMPLATE="" # model template for BENCHMARK in "${BENCHMARKS[@]}"; do python __main__.py \ --benchmark ${BENCHMARK} \ --output_dir ${OUTPUT_DIR} \ --generation_backend ${GENERATION_BACKEND} \ --model_id ${MODEL_ID} \ --model_name_or_path ${MODEL_NAME_OR_PATH} \ --chat_template ${CHAT_TEMPLATE} done ``` For example, you can evaluate [LLaVA-v1.5-7B](https://huggingface.co/llava-hf/llava-1.5-7b-hf) (HF format) on [POPE](https://huggingface.co/datasets/lmms-lab/POPE) and [MM-SafetyBench](https://huggingface.co/datasets/PKU-Alignment/MM-SafetyBench) benchmarks using the follow script: ```bash SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" cd "${SCRIPT_DIR}/../align_anything/evaluation" || exit 1 BENCHMARKS=("POPE" "MM-SafetyBench") # evaluation benchmarks OUTPUT_DIR="../output/evaluation" # output dir GENERATION_BACKEND="vLLM" # generation backend MODEL_ID="llava-1.5-7b-hf" # model's unique id MODEL_NAME_OR_PATH="llava-hf/llava-1.5-7b-hf" # model path CHAT_TEMPLATE="Llava" # model template for BENCHMARK in "${BENCHMARKS[@]}"; do python __main__.py \ --benchmark ${BENCHMARK} \ --output_dir ${OUTPUT_DIR} \ --generation_backend ${GENERATION_BACKEND} \ --model_id ${MODEL_ID} \ --model_name_or_path ${MODEL_NAME_OR_PATH} \ --chat_template ${CHAT_TEMPLATE} done ``` You can modify the configuration files for the benchmarks in [this directory](https://github.com/PKU-Alignment/align-anything/tree/main/align_anything/configs/evaluation/benchmarks) to suit specific evaluation tasks and models, and adjust inference parameters for [vLLM](https://github.com/PKU-Alignment/align-anything/tree/main/align_anything/configs/evaluation/vllm) or [DeepSpeed](https://github.com/PKU-Alignment/align-anything/tree/main/align_anything/configs/evaluation/deepspeed) based on your generation backend. For more details about the evaluation pipeline, refer to the [here](https://github.com/PKU-Alignment/align-anything/blob/main/align_anything/evaluation/README.md). # Inference ## Interactive Client ```bash python3 -m align_anything.serve.cli --model_name_or_path your_model_name_or_path ``` cli_demo

## Interactive Arena ```bash python3 -m align_anything.serve.arena \ --red_corner_model_name_or_path your_red_model_name_or_path \ --blue_corner_model_name_or_path your_blue_model_name_or_path ``` arena_demo

## Report Issues If you have any questions in the process of using align-anything, don't hesitate to ask your questions on [the GitHub issue page](https://github.com/PKU-Alignment/align-anything/issues/new/choose), we will reply to you in 2-3 working days. # Citation Please cite the repo if you use the data or code in this repo. ```bibtex @misc{align_anything, author = {PKU-Alignment Team}, title = {Align Anything: training all modality models to follow instructions with unified language feedback}, year = {2024}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/PKU-Alignment/align-anything}}, } ``` # License align-anything is released under Apache License 2.0.