# Lyra
**Repository Path**: homer-1943/Lyra
## Basic Information
- **Project Name**: Lyra
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-12-18
- **Last Updated**: 2024-12-18
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
#
Lyra: An Efficient and Speech-Centric Framework
for Omni-Cognition

Overview of Lyra:
Lyra shows superiority compared with leading omni-models in:
1. Stronger performance: Achieve SOTA results across a variety of speech-centric tasks.
2. More versatile: Support image, video, speech/long-speech, sound understanding and speech generation.
3. More efficient: Less training data, support faster training and inference.
## Release
- [12/12] π₯ Lyra is coming! We release the [paper](https://arxiv.org/pdf/2412.09501.pdf), [demo](https://103.170.5.190:17860/), [code](https://github.com/dvlab-research/Lyra), [models](https://huggingface.co/collections/zszhong/lyra-model-674ea5bb3b39ff8f15de75fc), [training data](https://huggingface.co/collections/zszhong/lyra-data-675d80fbab80334eb52cdd82) and [evaluation data](https://huggingface.co/collections/zszhong/lyra-evaluation-675d7f038747ba865932a149). More related checkpoints will be released soon!
## Contents
- [Demo](#demo)
- [Install](#install)
- [Model](#model)
- [Preparation](#preparation)
- [Train](#train)
- [Evaluation](#evaluation)
- [Examples](#examples)
- [Citation](#citation)
- [Acknowledgement](#acknowledgement)
- [License](#license)
## Demo
We provide [video demo](https://www.youtube.com/watch?v=7kh-M0jmmtI) here for better experience and illustrations. More examples can be found in our [project page](https://lyra-omni.github.io/) and feel free to try our [online demo](https://103.170.5.190:17860/)! Due to the computing cost, GPU memory of the demo machine (GeForce RTX 3090), and uploading storage, the long-speech function is not supported for the current online demo. π°
## Install
Please follow the instructions below to install the required packages.
1. Clone this repository:
```bash
git clone https://github.com/dvlab-research/Lyra.git
```
2. Install Package:
```bash
conda create -n lyra python=3.10 -y
conda activate lyra
cd Lyra
pip install --upgrade pip
pip install -e .
```
3. Install optional packages for simultaneous text-speech generation:
```bash
pip install pip==24.0
pip install fairseq==0.12.2
pip install --upgrade pip
```
## Model
Lyra supports multi-modal inputs. When the data contains a speech modality, we use the **latent cross-modality regularizer** to assist. Data from each modality is processed through encoders and projectors before being sent into the LLM. Within the LLM, **multi-modality LoRA** and l**atent multi-modality extraction** modules operate synergistically, facilitating the **simultaneous generation** of both speech and text outputs.
We provide all our fully finetuned models:
| Model | Base LLM | Vision Encoder | Speech Encoder | Projector | Full |
| ------------ | ------------------ | ------------------ | ------------------------------------------------------------ | ----------- | ------------------------------------------------------ |
| Lyra_Mini_3B | [Qwen2VL_2B_LLM]() | [Qwen2VL_2B_ViT]() | [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | [3B_proj]() | [3B_ckpt](https://huggingface.co/zszhong/Lyra_Mini_3B) |
| Lyra_Base_9B | [Qwen2VL_7B_LLM]() | [Qwen2VL_7B_ViT]() | [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | [9B_proj]() | [9B_ckpt](https://huggingface.co/zszhong/Lyra_Base_9B) |
| Lyra_Pro_74B | Qwen2VL_70B_LLM | Qwen2VL_70B_ViT | whisper-large-v3 | 74B_proj | 74B_ckpt |
## Preparation
### Training Data
We provide the processed data for the model training. All speech-related training data can be downloaded [Lyra-Data](https://huggingface.co/collections/zszhong/lyra-data-675d80fbab80334eb52cdd82).
For **model pretraining data**, please download the following the training multi-modality data and organize them as:
`β` means put the data in the local folder. The pretraining json file can be downloaded from [Lyra_Pretrain](https://huggingface.co/datasets/zszhong/Lyra-Data/tree/main/Lyra_Pretrain).
- [LibriSpeech](https://www.openslr.org/12) β `data/Lyra_Pretrain/LibriSpeech`
β and β `data/Lyra_SFT/multi_modality_speech/LibriSpeech`
β and β `data/Lyra_Eval/LibriSpeech` download all training and develop data.
- [Common Voice](https://commonvoice.mozilla.org/en/datasets) β `data/Lyra_Pretrain/CommonVoice` download the English Common Voice Corpus.
During the pretraining process, we filtered out some noisy and short audio speech data.
For the **image part of finetuning data**, similar to Mini-Gemini, please download the following the instruction data and organize them as:
`β` means put the data in the local folder.
- [COCO train2017](http://images.cocodataset.org/zips/train2017.zip) β `data/Lyra_SFT/multi_modality_image/coco`
- [GQA](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip) β `data/Lyra_SFT/multi_modality_image/gqa`
- [OCR-VQA](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing) (**we save all files as `.jpg`**) β `data/Lyra_SFT/multi_modality_image/ocr_vqa`
- [TextVQA](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip) (not included for training) β `data/Lyra_SFT/multi_modality_image/textvqa`
- [VisualGenome part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [VisualGenome part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip) β `data/Lyra_SFT/multi_modality_image/vg`
- [ShareGPT4V-100K](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md) β `data/Lyra_SFT/multi_modality_image/sam`, `share_textvqa`, `wikiart`, ...
- [LAION GPT4V](https://huggingface.co/datasets/laion/gpt4v-dataset) β `data/Lyra_SFT/multi_modality_image/gpt4v-dataset`
- [ALLaVA Instruction](https://github.com/FreedomIntelligence/ALLaVA) β `data/Lyra_SFT/multi_modality_image/ALLaVA-4V`
- [DocVQA](https://www.docvqa.org/datasets/docvqa) β `data/Lyra_SFT/multi_modality_image/docvqa`
- [ChartQA](https://github.com/vis-nlp/ChartQA) β `data/Lyra_SFT/multi_modality_image/chartqa`
- [DVQA](https://github.com/kushalkafle/DVQA_dataset) β `data/Lyra_SFT/multi_modality_image/dvqa`
- [AI2D](https://allenai.org/data/diagrams) β `data/Lyra_SFT/multi_modality_image/ai2d`
For the **audio part of finetuning data**, please download the following the instruction data and organize them as:
`β` means put the data in the local folder.
- [Lyra_MultiModal](https://huggingface.co/datasets/zszhong/Lyra-Data/tree/main/Lyra_SFT/multi_modality_speech) β `data/Lyra_SFT/multi_modality_speech/Lyra_MM`
For reproduced details, please refer the [Lyra multi-modality preparation](https://github.com/dvlab-research/Lyra/tree/main/data_preparation/multi_modality).
For the **long speech** audio finetuning data, please download the following the instruction data and organize them as:
`β` means put the data in the local folder.
- [Lyra_LongSpeech](https://huggingface.co/datasets/zszhong/Lyra-Data/tree/main/Lyra_SFT/long_speech) β `data/Lyra_SFT/long_speech/Lyra_LongSpeech`
For reproduced details, please refer the [Lyra long-speech preparation](https://github.com/dvlab-research/Lyra/tree/main/data_preparation/long_speech).
For the **text-speech generation** data, please download the following the instruction data and organize them as:
`β` means put the data in the local folder.
- [Lyra_SpeechGeneration](https://huggingface.co/datasets/zszhong/Lyra-Data/tree/main/Lyra_SFT/speech_generation) β `data/Lyra_SFT/speech_generation`
For reproduced details, please refer the [Lyra speech generation preparation](https://github.com/dvlab-research/Lyra/tree/main/data_preparation/speech_generation).
### Evaluation Data
All speech-related evaluation data can be downloaded [Lyra-Evaluation](https://huggingface.co/collections/zszhong/lyra-evaluation-675d7f038747ba865932a149).
For **speech-centric evaluation data**, we mainly consider three types:
1. **text-speech ability**: LibriSpeech, Lyra_needle_in_a_haystack
- [Lyra_needle_in_a_haystack](https://huggingface.co/datasets/zszhong/Lyra-Eval/tree/main/Lyra_needle_in_a_haystack) β `data/Lyra_Eval/Lyra_needle_in_a_haystack`
2. **image-speech ability**: TextVQA_speech, MM_vet_speech, Docvqa_val, Chartvqa_human
- [TextVQA_speech](https://huggingface.co/datasets/zszhong/Lyra-Eval/tree/main/TextVQA_speech) β `data/Lyra_Eval/TextVQA_speech`
- [MM_vet_speech](https://huggingface.co/datasets/zszhong/Lyra-Eval/tree/main/MM_vet_speech) β `data/Lyra_Eval/MM_vet_speech`
- [Docvqa_val](https://huggingface.co/datasets/zszhong/Lyra-Eval/tree/main/Docvqa_val) β `data/Lyra_Eval/Docvqa_val`
- [Chartvqa_human](https://huggingface.co/datasets/zszhong/Lyra-Eval/tree/main/Chartvqa_human) β `data/Lyra_Eval/Chartvqa_human`
3. **video-speech ability**: VideoMME_speech
- [VideoMME_speech](https://huggingface.co/datasets/zszhong/Lyra-Eval/tree/main/VideoMME_speech) β `data/Lyra_Eval/VideoMME_speech`
Please put the pretrained data, finetuned data, and eval data in `Lyra_Pretrain`, `Lyra_SFT`, and `Lyra_Eval` subset following [Structure](#structure).
### Pretrained Weights
We recommend users to download the pretrained weights from the following link:
Qwen2VL_XB_LLM and Qwen2VL_XB_ViT are extracted from [Qwen2-VL](https://github.com/QwenLM/Qwen2-VL) to adapt to our training framework.
For your convenience we also provide the corresponding download links in the [Model](#model) part.
[whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo), [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3), [imagebind_huge](https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth), and put them in `model_zoo` following [Structure](#structure).
Download the unit-based HiFi-GAN vocoder using the follow commands:
```shell
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000 -P model_zoo/audio/vocoder/
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json -P model_zoo/audio/vocoder/
```
### Structure
The folder structure should be organized as follows before training.
```
Lyra
βββ lyra
βββ scripts
βββ work_dirs
β βββ Lyra
β β βββ Lyra_Mini_3B
β β βββ Lyra_Base_9B
β β βββ Lyra_Pro_74B
β β βββ ...
βββ model_zoo
β βββ LLM
β β βββ Qwen2VL_2B_LLM
β β βββ Qwen2VL_7B_LLM
β β βββ Qwen2VL_70B_LLM
β β βββ Qwen2.5
β β βββ LLaMA3.2
β β βββ ...
β βββ vision
β β βββ Qwen2VL_2B_ViT
β β βββ Qwen2VL_7B_ViT
β β βββ Qwen2VL_70B_ViT
β β βββ clip-vit-large
β β βββ siglip
β β βββ ConvNeXt
β β βββ ...
β βββ audio
β β βββ whisper-large-v3-turbo
β β βββ whisper-large-v3
β β βββ imagebind_huge
β β βββ vocoder
β β βββ ...
βββ data
β βββ Lyra_Pretrain
β β βββ lyra_pretrain.json
β β βββ LibriSpeech
β β βββ CommonVoice
β βββ Lyra_SFT
β β βββ multi_modality_speech
β β β βββ lyra_multimodal.json
β β β βββ Lyra_MM
β β β βββ LibriSpeech
β β βββ multi_modality_image (similar to MGM-Finetune)
β β β βββ llava
β β β βββ coco
β β β βββ gqa
β β β βββ ocr_vqa
β β β βββ textvqa
β β β βββ vg
β β β βββ gpt4v-dataset
β β β βββ ...
β β βββ long_speech
β β β βββ lyra_longspeech.json
β β β βββ Lyra_LongSpeech
β β βββ speech_generation
β β β βββ lyra_speechgeneration.json
β βββ Lyra_Eval
β β βββ LibriSpeech
β β βββ TextVQA_speech
β β βββ MM_vet_speech
β β βββ Docvqa_val
β β βββ Chartvqa_human
β β βββ VideoMME_speech
β β βββ Lyra_needle_in_a_haystack
```
## Train
The training process consists of four stages: (1) feature alignment stage: bridge the speech and language tokens; (2) multi-modality instruction tuning stage: teach the model to follow text-image-speech multimodal instructions. (3) long-speech instruction tuning stage: enable the model to handle long speech audios. (4) text-speech streaming generation stage: Enable the model to stream both text and speech simultaneously.
Our models are trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`.
Please make sure you download and organize the data following [Preparation](#preparation) before training.
NOTE: Please set `hostfile/hostfile_2` for 2 machine training and `hostfile/hostfile_4` for 4 machine training.
(1) feature alignment stage:
```bash
bash scripts/train/Lyra_Base_9B/Lyra_Base_qwen2vl_9B_Pretrain.sh
```
(2) multi-modality instruction tuning stage:
```bash
bash scripts/train/Lyra_Base_9B/Lyra_Base_qwen2vl_9B_SFT_text_image_speech.sh
```
(3) long-speech instruction tuning stage:
```bash
bash scripts/train/Lyra_Base_9B/Lyra_Base_qwen2vl_9B_SFT_long_speech.sh
```
(4) text-speech streaming generation stage:
```bash
bash scripts/train/Lyra_Base_9B/Lyra_Base_qwen2vl_9B_SFT_speech_generate.sh
```
## Evaluation
### Benchmarks Results
| Omni Comparison |
Params. |
Text-Image |
Text-Video |
Image-Speech |
Text-Speech |
| TextVQA |
MME |
MM-Vet |
VideoMME |
MVBench |
Egoschema |
TextVQAs |
DocVQAs |
ChartQAs |
LibriSpeech |
| Mini-Gemini |
8B |
71.9 |
1989 |
53.5 |
- |
- |
- |
- |
- |
- |
- |
| LLaVA-OV |
7B |
65.4 |
1998 |
57.5 |
58.2 |
56.7 |
60.1 |
- |
- |
- |
- |
| Intern-VL2 |
8B |
77.4 |
2211 |
60.0 |
54.0 |
- |
- |
- |
- |
- |
- |
| Mini-Omni |
7B |
- |
- |
- |
- |
- |
- |
- |
- |
- |
4.5 |
| SALMONN |
13B |
- |
- |
- |
- |
- |
- |
- |
- |
- |
2.1 |
| Qwen2-Audio |
8B |
- |
- |
- |
- |
- |
- |
- |
- |
- |
1.6 |
| Intern-Omni |
8B |
80.6 |
2210 |
60.0 |
- |
- |
- |
69.1 |
79.9 |
56.0 |
- |
| VITA |
66B |
- |
2097 |
41.6 |
59.2 |
- |
- |
- |
- |
- |
8.1 |
| EMOVA |
14B |
82.0 |
2205 |
55.8 |
- |
- |
- |
- |
- |
- |
4.0 |
| Lyra-Mini |
3B |
78.3 |
1884 |
51.2 |
55.0 |
62.5 |
54.1 |
73.9 |
75.0 |
40.7 |
2.4 |
| Lyra-Base |
9B |
82.6 |
2335 |
63.5 |
62.8 |
67.2 |
63.2 |
80.0 |
85.5 |
61.0 |
2.0 |
| Lyra-Pro |
74B |
83.5 |
2485 |
71.4 |
69.9 |
72.3 |
75.8 |
81.0 |
89.4 |
68.5 |
1.8 |
### Benchmarks scripts
Please make sure you download and organize the [evaluation data](https://huggingface.co/collections/zszhong/lyra-evaluation-675d7f038747ba865932a149) following [Preparation](#preparation) before starting evaluation.
We provide four speech **speech-centric** evaluation benchmark scripts here:
**Text-speech ability**: LibriSpeech:
```bash
# you can change the model path and lora path in the script:
# e.g., CKPT="Lyra_Mini_3B", LORA_PATH="Lyra_Mini_3B/speech_lora"
# e.g., CKPT="Lyra_Base_9B", LORA_PATH="Lyra_Base_9B/speech_lora"
# the LibriSpeech test-clean WER result of Lyra-Mini-3B is about 2.4%
# the LibriSpeech test-clean WER result of Lyra-Base-9B is about 2.0%
bash scripts/eval/lyra_librispeech_wer.sh
```
**Image-speech ability**: TextVQA_speech:
```bash
# the TextVQA (speech) accuracy result of Lyra-Mini-3B is about 73.6%
# the TextVQA (speech) accuracy result of Lyra-Base-9B is about 80.5%
bash scripts/eval/lyra_textvqa_speech.sh
```
**Image-speech ability**: Chartvqa_human:
```bash
# the ChartQA (speech) accuracy result of Lyra-Mini-3B is about 42.2%
# the ChartQA (speech) accuracy result of Lyra-Base-9B is about 61.0%
bash scripts/eval/lyra_chartvqa_speech.sh
```
**Image-speech ability**: Docvqa_val:
```bash
# the DocVQA (speech) accuracy result of Lyra-Mini-3B is about 76.0%
# the DocVQA (speech) accuracy result of Lyra-Base-9B is about 86.2%
bash scripts/eval/lyra_docvqa_speech.sh
```
### CLI Inference
Chat with images without the need of Gradio interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference.
Please make sure you have installed [fairseq](https://github.com/facebookresearch/fairseq) for speech generation, and try the following command for speech and generation inference:
```bash
# image-file:
# speech-file:
# generate speech: