# qwen2-audio
**Repository Path**: mirrors/qwen2-audio
## Basic Information
- **Project Name**: qwen2-audio
- **Description**: Qwen2-Audio 作为一个大规模音频语言模型,Qwen2-Audio能够接受各种音频信号输入,并根据语音指令执行音频分析或直接响应文本
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: https://www.oschina.net/p/qwen2-audio
- **GVP Project**: No
## Statistics
- **Stars**: 2
- **Forks**: 1
- **Created**: 2024-08-13
- **Last Updated**: 2025-10-18
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
中文  |  English
Qwen2-Audio-7B 🤖 | 🤗  | Qwen-Audio-7B-Instruct 🤖 | 🤗  | Demo 🤖 | 🤗 
📑 Paper    |    📑 Blog    |    💬 WeChat (微信)   |    Discord  
我们介绍Qwen-Audio的最新进展:Qwen2-Audio。作为一个大规模音频语言模型,Qwen2-Audio能够接受各种音频信号输入,并根据语音指令执行音频分析或直接响应文本。我们介绍两种不同的音频交互模式:语音聊天 voice chat 和音频分析 audio analysis。
* 语音聊天:用户可以自由地与 Qwen2-Audio 进行语音互动,而无需文本输入;
* 音频分析:用户可以在互动过程中提供音频和文本指令对音频进行分析;
**我们已经开源了 Qwen2-Audio 系列的两个模型:Qwen2-Audio-7B和Qwen2-Audio-7B-Instruct。**
## 模型结构与训练范式
Qwen2-Audio 三阶段训练过程概述。
## 新闻
* 2024.8.9 🎉 我们在 ModelScope 和 Hugging Face 开源了`Qwen2-Audio-7B`和`Qwen2-Audio-7B-Instruct`的 checkpoint.
* 2024.7.15 🎉 我们发布了 Qwen2-Audio 的[论文](https://arxiv.org/abs/2407.10759), 介绍了相关的模型结构,训练方法和模型表现。
* 2023.11.30 🔥 我们发布了**Qwen-Audio**系列
## 评测
我们在标准的13个学术数据集上评测了模型的能力如下:
Task | Description | Dataset | Split | Metric |
---|
ASR | Automatic Speech Recognition | Fleurs | dev | test | WER |
Aishell2 | test |
Librispeech | dev | test |
Common Voice | dev | test |
S2TT | Speech-to-Text Translation | CoVoST2 | test | BLEU |
SER | Speech Emotion Recognition | Meld | test | ACC |
VSC | Vocal Sound Classification | VocalSound | test | ACC |
AIR-Bench
| Chat-Benchmark-Speech | Fisher SpokenWOZ IEMOCAP Common voice | dev | test | GPT-4 Eval |
Chat-Benchmark-Sound | Clotho | dev | test | GPT-4 Eval |
Chat-Benchmark-Music | MusicCaps | dev | test | GPT-4 Eval |
Chat-Benchmark-Mixed-Audio | Common voice AudioCaps MusicCaps | dev | test | GPT-4 Eval |
以下是整体表现:
评测分数详情如下:
(注意:我们所展示的评测结果是在原始训练框架的初始模型上的,然而在框架转换 Huggingface 后指标出现了部分波动,在这里我们展示我们的全部测评结果:首先是论文中的初始模型结果)
Task | Dataset | Model | Performance |
---|
Metrics | Results |
---|
ASR | Librispeech dev-clean | dev-other | test-clean | test-other | SpeechT5 | WER | 2.1 | 5.5 | 2.4 | 5.8 |
SpeechNet | - | - | 30.7 | - |
SLM-FT | - | - | 2.6 | 5.0 |
SALMONN | - | - | 2.1 | 4.9 |
SpeechVerse | - | - | 2.1 | 4.4 |
Qwen-Audio | 1.8 | 4.0 | 2.0 | 4.2 |
Qwen2-Audio | 1.3 | 3.4 | 1.6 | 3.6 |
Common Voice 15 en | zh | yue | fr | Whisper-large-v3 | WER | 9.3 | 12.8 | 10.9 | 10.8 |
Qwen2-Audio | 8.6 | 6.9 | 5.9 | 9.6 |
Fleurs zh | Whisper-large-v3 | WER | 7.7 |
Qwen2-Audio | 7.5 |
Aishell2 Mic | iOS | Android | MMSpeech-base | WER | 4.5 | 3.9 | 4.0 |
Paraformer-large | - | 2.9 | - |
Qwen-Audio | 3.3 | 3.1 | 3.3 |
Qwen2-Audio | 3.0 | 3.0 | 2.9 |
S2TT | CoVoST2 en-de | de-en | en-zh | zh-en | SALMONN | BLEU | 18.6 | - | 33.1 | - |
SpeechLLaMA | - | 27.1 | - | 12.3 |
BLSP | 14.1 | - | - | - |
Qwen-Audio | 25.1 | 33.9 | 41.5 | 15.7 |
Qwen2-Audio | 29.9 | 35.2 | 45.2 | 24.4 |
CoVoST2 es-en | fr-en | it-en | | SpeechLLaMA | BLEU | 27.9 | 25.2 | 25.9 |
Qwen-Audio | 39.7 | 38.5 | 36.0 |
Qwen2-Audio | 40.0 | 38.5 | 36.3 |
SER | Meld | WavLM-large | ACC | 0.542 |
Qwen-Audio | 0.557 |
Qwen2-Audio | 0.553 |
VSC | VocalSound | CLAP | ACC | 0.4945 |
Pengi | 0.6035 |
Qwen-Audio | 0.9289 |
Qwen2-Audio | 0.9392 |
AIR-Bench
| Chat Benchmark Speech | Sound | Music | Mixed-Audio | SALMONN BLSP Pandagpt Macaw-LLM SpeechGPT Next-gpt Qwen-Audio Gemini-1.5-pro Qwen2-Audio | GPT-4 | 6.16 | 6.28 | 5.95 | 6.08 6.17 | 5.55 | 5.08 | 5.33 3.58 | 5.46 | 5.06 | 4.25 0.97 | 1.01 | 0.91 | 1.01 1.57 | 0.95 | 0.95 | 4.13 3.86 | 4.76 | 4.18 | 4.13 6.47 | 6.95 | 5.52 | 6.08 6.97 | 5.49 | 5.06 | 5.27 7.18 | 6.99 | 6.79 | 6.77 |
(其次是转换 huggingface 后的)
Task | Dataset | Model | Performance |
---|
Metrics | Results |
---|
ASR | Librispeech dev-clean | dev-other | test-clean | test-other | SpeechT5 | WER | 2.1 | 5.5 | 2.4 | 5.8 |
SpeechNet | - | - | 30.7 | - |
SLM-FT | - | - | 2.6 | 5.0 |
SALMONN | - | - | 2.1 | 4.9 |
SpeechVerse | - | - | 2.1 | 4.4 |
Qwen-Audio | 1.8 | 4.0 | 2.0 | 4.2 |
Qwen2-Audio | 1.7 | 3.6 | 1.7 | 4.0 |
Common Voice 15 en | zh | yue | fr | Whisper-large-v3 | WER | 9.3 | 12.8 | 10.9 | 10.8 |
Qwen2-Audio | 8.7 | 6.5 | 5.9 | 9.6 |
Fleurs zh | Whisper-large-v3 | WER | 7.7 |
Qwen2-Audio | 7.0 |
Aishell2 Mic | iOS | Android | MMSpeech-base | WER | 4.5 | 3.9 | 4.0 |
Paraformer-large | - | 2.9 | - |
Qwen-Audio | 3.3 | 3.1 | 3.3 |
Qwen2-Audio | 3.2 | 3.1 | 2.9 |
S2TT | CoVoST2 en-de | de-en | en-zh | zh-en | SALMONN | BLEU | 18.6 | - | 33.1 | - |
SpeechLLaMA | - | 27.1 | - | 12.3 |
BLSP | 14.1 | - | - | - |
Qwen-Audio | 25.1 | 33.9 | 41.5 | 15.7 |
Qwen2-Audio | 29.6 | 33.6 | 45.6 | 24.0 |
CoVoST2 es-en | fr-en | it-en | | SpeechLLaMA | BLEU | 27.9 | 25.2 | 25.9 |
Qwen-Audio | 39.7 | 38.5 | 36.0 |
Qwen2-Audio | 38.7 | 37.2 | 35.2 |
SER | Meld | WavLM-large | ACC | 0.542 |
Qwen-Audio | 0.557 |
Qwen2-Audio | 0.535 |
VSC | VocalSound | CLAP | ACC | 0.4945 |
Pengi | 0.6035 |
Qwen-Audio | 0.9289 |
Qwen2-Audio | 0.9395 |
AIR-Bench
| Chat Benchmark Speech | Sound | Music | Mixed-Audio | SALMONN BLSP Pandagpt Macaw-LLM SpeechGPT Next-gpt Qwen-Audio Gemini-1.5-pro Qwen2-Audio | GPT-4 | 6.16 | 6.28 | 5.95 | 6.08 6.17 | 5.55 | 5.08 | 5.33 3.58 | 5.46 | 5.06 | 4.25 0.97 | 1.01 | 0.91 | 1.01 1.57 | 0.95 | 0.95 | 4.13 3.86 | 4.76 | 4.18 | 4.13 6.47 | 6.95 | 5.52 | 6.08 6.97 | 5.49 | 5.06 | 5.27 7.24 | 6.83 | 6.73 | 6.42 |
我们提供了以上**所有**评测脚本以供复现我们的实验结果。请阅读 [eval_audio/EVALUATION.md](eval_audio/EVALUATION.md) 了解更多信息。
## 部署要求
The code of Qwen2-Audio has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error:
Qwen2-Audio的代码已经包含在最新的 Hugging Face Transformers 的主分支中,我们建议您使用命令`pip install git+https://github.com/huggingface/transformers`
```
KeyError: 'qwen2-audio'
```
## 快速使用
我们提供简单的示例来说明如何利用 🤗 Transformers 快速使用 Qwen2-Audio-7B 和 Qwen2-Audio-7B-Instruct。
在开始前,请确保你已经配置好环境并安装好相关的代码包。最重要的是,确保你满足上述要求,然后安装相关的依赖库。
接下来你可以开始使用 Transformers 或者 ModelScope 来使用我们的模型。目前Qwen2-Audio-7B 及 Qwen2-Audio-7B-Instruct 模型处理30秒以内的音频表现更佳。
#### 🤗 Hugging Face Transformers
如希望使用 Qwen2-Audio-7B-Instruct 进行推理,我们分别演示语音聊天和音频分析的交互方式,所需要写的只是如下所示的数行代码。
##### 语音聊天推理
在语音聊天模式下,用户可以自由地与 Qwen2-Audio 进行语音交互,无需文字输入:
```python
from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")
conversation = [
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},
]},
{"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},
]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
if isinstance(message["content"], list):
for ele in message["content"]:
if ele["type"] == "audio":
audios.append(librosa.load(
BytesIO(urlopen(ele['audio_url']).read()),
sr=processor.feature_extractor.sampling_rate)[0]
)
inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")
generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
```
##### 音频分析推理
在音频分析中,用户可以提供音频和文字问题来实现对音频的分析:
```python
from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")
conversation = [
{'role': 'system', 'content': 'You are a helpful assistant.'},
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
{"type": "text", "text": "What's that sound?"},
]},
{"role": "assistant", "content": "It is the sound of glass shattering."},
{"role": "user", "content": [
{"type": "text", "text": "What can you do when you hear that?"},
]},
{"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."},
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
{"type": "text", "text": "What does the person say?"},
]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
if isinstance(message["content"], list):
for ele in message["content"]:
if ele["type"] == "audio":
audios.append(
librosa.load(
BytesIO(urlopen(ele['audio_url']).read()),
sr=processor.feature_extractor.sampling_rate)[0]
)
inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")
generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
```
##### 批量推理
我们也支持批量推理:
```python
from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")
conversation1 = [
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
{"type": "text", "text": "What's that sound?"},
]},
{"role": "assistant", "content": "It is the sound of glass shattering."},
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/f2641_0_throatclearing.wav"},
{"type": "text", "text": "What can you hear?"},
]}
]
conversation2 = [
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
{"type": "text", "text": "What does the person say?"},
]},
]
conversations = [conversation1, conversation2]
text = [processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) for conversation in conversations]
audios = []
for conversation in conversations:
for message in conversation:
if isinstance(message["content"], list):
for ele in message["content"]:
if ele["type"] == "audio":
audios.append(
librosa.load(
BytesIO(urlopen(ele['audio_url']).read()),
sr=processor.feature_extractor.sampling_rate)[0]
)
inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs['input_ids'] = inputs['input_ids'].to("cuda")
inputs.input_ids = inputs.input_ids.to("cuda")
generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
```
运行Qwen2-Audio-7B同样非常简单。
```python
from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True)
prompt = "<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:"
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3"
audio, sr = librosa.load(BytesIO(urlopen(url).read()), sr=processor.feature_extractor.sampling_rate)
inputs = processor(text=prompt, audios=audio, return_tensors="pt")
generated_ids = model.generate(**inputs, max_length=256)
generated_ids = generated_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
```
#### 🤖 ModelScope
我们强烈建议用户,特别是中国大陆地区的用户,使用 ModelScope。`snapshot_download` 可以帮助您解决下载检查点时遇到的问题。
## Demo
### Web UI
我们提供了 Web UI 的 demo 供用户使用。在开始前,确保已经安装如下代码库:
```
pip install -r requirements_web_demo.txt
```
随后运行如下命令,并点击生成链接:
```
python demo/web_demo_audio.py
```
## 样例展示
更多样例将更新于[通义千问博客](https://qwenlm.github.io/blog/qwen2-audio)中的 Qwen2-Audio 博客。
## 团队招聘
我们是通义千问语音多模态团队,致力于以通义千问为核心,拓展音频多模态理解和生成能力,实现自由灵活的音频交互。目前团队蓬勃发展中,如有意向实习或全职加入我们,请发送简历至 `qwen_audio@list.alibaba-inc.com`.
## 使用协议
请查看每个模型在其 Hugging Face 仓库中的许可证。您无需提交商业使用申请。
## 引用
如果你觉得我们的论文和代码对你的研究有帮助,请考虑 :star: 和引用 :pencil: :)
```BibTeX
@article{Qwen-Audio,
title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2311.07919},
year={2023}
}
```
```BibTeX
@article{Qwen2-Audio,
title={Qwen2-Audio Technical Report},
author={Chu, Yunfei and Xu, Jin and Yang, Qian and Wei, Haojie and Wei, Xipin and Guo, Zhifang and Leng, Yichong and Lv, Yuanjun and He, Jinzheng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2407.10759},
year={2024}
}
```
## 联系我们
如果你想给我们的研发团队和产品团队留言,请通过邮件(`qianwen_opensource@alibabacloud.com`)联系我们。