# qwen2-audio **Repository Path**: mirrors/qwen2-audio ## Basic Information - **Project Name**: qwen2-audio - **Description**: Qwen2-Audio 作为一个大规模音频语言模型，Qwen2-Audio能够接受各种音频信号输入，并根据语音指令执行音频分析或直接响应文本 - **Primary Language**: Python - **License**: Not specified - **Default Branch**: main - **Homepage**: https://www.oschina.net/p/qwen2-audio - **GVP Project**: No ## Statistics - **Stars**: 2 - **Forks**: 1 - **Created**: 2024-08-13 - **Last Updated**: 2025-10-18 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

中文｜ English

我们介绍Qwen-Audio的最新进展：Qwen2-Audio。作为一个大规模音频语言模型，Qwen2-Audio能够接受各种音频信号输入，并根据语音指令执行音频分析或直接响应文本。我们介绍两种不同的音频交互模式：语音聊天 voice chat 和音频分析 audio analysis。 * 语音聊天：用户可以自由地与 Qwen2-Audio 进行语音互动，而无需文本输入； * 音频分析：用户可以在互动过程中提供音频和文本指令对音频进行分析； **我们已经开源了 Qwen2-Audio 系列的两个模型：Qwen2-Audio-7B和Qwen2-Audio-7B-Instruct。** ## 模型结构与训练范式 Qwen2-Audio 三阶段训练过程概述。

## 新闻 * 2024.8.9 🎉 我们在 ModelScope 和 Hugging Face 开源了`Qwen2-Audio-7B`和`Qwen2-Audio-7B-Instruct`的 checkpoint. * 2024.7.15 🎉 我们发布了 Qwen2-Audio 的[论文](https://arxiv.org/abs/2407.10759), 介绍了相关的模型结构，训练方法和模型表现。 * 2023.11.30 🔥 我们发布了**Qwen-Audio**系列
## 评测我们在标准的13个学术数据集上评测了模型的能力如下：
Task Description Dataset Split Metric
ASR Automatic Speech Recognition Fleurs dev | test WER
Aishell2 test
Librispeech dev | test
Common Voice dev | test
S2TT Speech-to-Text Translation CoVoST2 test BLEU
SER Speech Emotion Recognition Meld test ACC
VSC Vocal Sound Classification VocalSound test ACC
AIR-Bench
Chat-Benchmark-Speech Fisher
SpokenWOZ
IEMOCAP
Common voice dev | test GPT-4 Eval
Chat-Benchmark-Sound Clotho dev | test GPT-4 Eval

Chat-Benchmark-Music MusicCaps dev | test GPT-4 Eval
Chat-Benchmark-Mixed-Audio Common voice
AudioCaps
MusicCaps dev | test GPT-4 Eval
以下是整体表现：

Task	Description	Dataset	Split	Metric
ASR	Automatic Speech Recognition	Fleurs	dev \| test	WER
Aishell2	test
Librispeech	dev \| test
Common Voice	dev \| test
S2TT	Speech-to-Text Translation	CoVoST2	test	BLEU
SER	Speech Emotion Recognition	Meld	test	ACC
VSC	Vocal Sound Classification	VocalSound	test	ACC
AIR-Bench	Chat-Benchmark-Speech	Fisher SpokenWOZ IEMOCAP Common voice	dev \| test	GPT-4 Eval
Chat-Benchmark-Sound	Clotho	dev \| test	GPT-4 Eval
Chat-Benchmark-Music	MusicCaps	dev \| test	GPT-4 Eval
Chat-Benchmark-Mixed-Audio	Common voice AudioCaps MusicCaps	dev \| test	GPT-4 Eval

评测分数详情如下：
（注意：我们所展示的评测结果是在原始训练框架的初始模型上的，然而在框架转换 Huggingface 后指标出现了部分波动，在这里我们展示我们的全部测评结果：首先是论文中的初始模型结果）
Task Dataset Model Performance
Metrics Results
ASR Librispeech
dev-clean | dev-other |
test-clean | test-other SpeechT5 WER 2.1 | 5.5 | 2.4 | 5.8
SpeechNet - | - | 30.7 | -
SLM-FT - | - | 2.6 | 5.0
SALMONN - | - | 2.1 | 4.9
SpeechVerse - | - | 2.1 | 4.4
Qwen-Audio 1.8 | 4.0 | 2.0 | 4.2
Qwen2-Audio 1.3 | 3.4 | 1.6 | 3.6
Common Voice 15
en | zh | yue | fr Whisper-large-v3 WER 9.3 | 12.8 | 10.9 | 10.8
Qwen2-Audio 8.6 | 6.9 | 5.9 | 9.6

Fleurs
zh Whisper-large-v3 WER 7.7
Qwen2-Audio 7.5
Aishell2
Mic | iOS | Android MMSpeech-base WER 4.5 | 3.9 | 4.0
Paraformer-large - | 2.9 | -
Qwen-Audio 3.3 | 3.1 | 3.3
Qwen2-Audio 3.0 | 3.0 | 2.9
S2TT CoVoST2
en-de | de-en |
en-zh | zh-en SALMONN BLEU 18.6 | - | 33.1 | -
SpeechLLaMA - | 27.1 | - | 12.3
BLSP 14.1 | - | - | -
Qwen-Audio 25.1 | 33.9 | 41.5 | 15.7
Qwen2-Audio 29.9 | 35.2 | 45.2 | 24.4

CoVoST2
es-en | fr-en | it-en | SpeechLLaMA BLEU 27.9 | 25.2 | 25.9
Qwen-Audio 39.7 | 38.5 | 36.0
Qwen2-Audio 40.0 | 38.5 | 36.3
SER Meld WavLM-large ACC 0.542
Qwen-Audio 0.557
Qwen2-Audio 0.553
VSC VocalSound CLAP ACC 0.4945
Pengi 0.6035
Qwen-Audio 0.9289
Qwen2-Audio 0.9392

AIR-Bench
Chat Benchmark
Speech | Sound |
Music | Mixed-Audio SALMONN
BLSP
Pandagpt
Macaw-LLM
SpeechGPT
Next-gpt
Qwen-Audio
Gemini-1.5-pro
Qwen2-Audio GPT-4 6.16 | 6.28 | 5.95 | 6.08
6.17 | 5.55 | 5.08 | 5.33
3.58 | 5.46 | 5.06 | 4.25
0.97 | 1.01 | 0.91 | 1.01
1.57 | 0.95 | 0.95 | 4.13
3.86 | 4.76 | 4.18 | 4.13
6.47 | 6.95 | 5.52 | 6.08
6.97 | 5.49 | 5.06 | 5.27
7.18 | 6.99 | 6.79 | 6.77
（其次是转换 huggingface 后的）
Task Dataset Model Performance
Metrics Results
ASR Librispeech
dev-clean | dev-other |
test-clean | test-other SpeechT5 WER 2.1 | 5.5 | 2.4 | 5.8
SpeechNet - | - | 30.7 | -
SLM-FT - | - | 2.6 | 5.0
SALMONN - | - | 2.1 | 4.9
SpeechVerse - | - | 2.1 | 4.4
Qwen-Audio 1.8 | 4.0 | 2.0 | 4.2
Qwen2-Audio 1.7 | 3.6 | 1.7 | 4.0
Common Voice 15
en | zh | yue | fr Whisper-large-v3 WER 9.3 | 12.8 | 10.9 | 10.8
Qwen2-Audio 8.7 | 6.5 | 5.9 | 9.6

Fleurs
zh Whisper-large-v3 WER 7.7
Qwen2-Audio 7.0
Aishell2
Mic | iOS | Android MMSpeech-base WER 4.5 | 3.9 | 4.0
Paraformer-large - | 2.9 | -
Qwen-Audio 3.3 | 3.1 | 3.3
Qwen2-Audio 3.2 | 3.1 | 2.9
S2TT CoVoST2
en-de | de-en |
en-zh | zh-en SALMONN BLEU 18.6 | - | 33.1 | -
SpeechLLaMA - | 27.1 | - | 12.3
BLSP 14.1 | - | - | -
Qwen-Audio 25.1 | 33.9 | 41.5 | 15.7
Qwen2-Audio 29.6 | 33.6 | 45.6 | 24.0

CoVoST2
es-en | fr-en | it-en | SpeechLLaMA BLEU 27.9 | 25.2 | 25.9
Qwen-Audio 39.7 | 38.5 | 36.0
Qwen2-Audio 38.7 | 37.2 | 35.2
SER Meld WavLM-large ACC 0.542
Qwen-Audio 0.557
Qwen2-Audio 0.535
VSC VocalSound CLAP ACC 0.4945
Pengi 0.6035
Qwen-Audio 0.9289
Qwen2-Audio 0.9395

AIR-Bench
Chat Benchmark
Speech | Sound |
Music | Mixed-Audio SALMONN
BLSP
Pandagpt
Macaw-LLM
SpeechGPT
Next-gpt
Qwen-Audio
Gemini-1.5-pro
Qwen2-Audio GPT-4 6.16 | 6.28 | 5.95 | 6.08
6.17 | 5.55 | 5.08 | 5.33
3.58 | 5.46 | 5.06 | 4.25
0.97 | 1.01 | 0.91 | 1.01
1.57 | 0.95 | 0.95 | 4.13
3.86 | 4.76 | 4.18 | 4.13
6.47 | 6.95 | 5.52 | 6.08
6.97 | 5.49 | 5.06 | 5.27
7.24 | 6.83 | 6.73 | 6.42
我们提供了以上**所有**评测脚本以供复现我们的实验结果。请阅读 [eval_audio/EVALUATION.md](eval_audio/EVALUATION.md) 了解更多信息。 ## 部署要求 The code of Qwen2-Audio has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error: Qwen2-Audio的代码已经包含在最新的 Hugging Face Transformers 的主分支中，我们建议您使用命令`pip install git+https://github.com/huggingface/transformers` ``` KeyError: 'qwen2-audio' ``` ## 快速使用我们提供简单的示例来说明如何利用 🤗 Transformers 快速使用 Qwen2-Audio-7B 和 Qwen2-Audio-7B-Instruct。在开始前，请确保你已经配置好环境并安装好相关的代码包。最重要的是，确保你满足上述要求，然后安装相关的依赖库。接下来你可以开始使用 Transformers 或者 ModelScope 来使用我们的模型。目前Qwen2-Audio-7B 及 Qwen2-Audio-7B-Instruct 模型处理30秒以内的音频表现更佳。 #### 🤗 Hugging Face Transformers 如希望使用 Qwen2-Audio-7B-Instruct 进行推理，我们分别演示语音聊天和音频分析的交互方式，所需要写的只是如下所示的数行代码。 ##### 语音聊天推理在语音聊天模式下，用户可以自由地与 Qwen2-Audio 进行语音交互，无需文字输入： ```python from io import BytesIO from urllib.request import urlopen import librosa from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct") model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto") conversation = [ {"role": "user", "content": [ {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"}, ]}, {"role": "assistant", "content": "Yes, the speaker is female and in her twenties."}, {"role": "user", "content": [ {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"}, ]}, ] text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audios.append(librosa.load( BytesIO(urlopen(ele['audio_url']).read()), sr=processor.feature_extractor.sampling_rate)[0] ) inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True) inputs.input_ids = inputs.input_ids.to("cuda") generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ``` ##### 音频分析推理在音频分析中，用户可以提供音频和文字问题来实现对音频的分析： ```python from io import BytesIO from urllib.request import urlopen import librosa from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct") model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto") conversation = [ {'role': 'system', 'content': 'You are a helpful assistant.'}, {"role": "user", "content": [ {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"}, {"type": "text", "text": "What's that sound?"}, ]}, {"role": "assistant", "content": "It is the sound of glass shattering."}, {"role": "user", "content": [ {"type": "text", "text": "What can you do when you hear that?"}, ]}, {"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."}, {"role": "user", "content": [ {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"}, {"type": "text", "text": "What does the person say?"}, ]}, ] text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audios.append( librosa.load( BytesIO(urlopen(ele['audio_url']).read()), sr=processor.feature_extractor.sampling_rate)[0] ) inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True) inputs.input_ids = inputs.input_ids.to("cuda") generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ``` ##### 批量推理我们也支持批量推理： ```python from io import BytesIO from urllib.request import urlopen import librosa from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct") model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto") conversation1 = [ {"role": "user", "content": [ {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"}, {"type": "text", "text": "What's that sound?"}, ]}, {"role": "assistant", "content": "It is the sound of glass shattering."}, {"role": "user", "content": [ {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/f2641_0_throatclearing.wav"}, {"type": "text", "text": "What can you hear?"}, ]} ] conversation2 = [ {"role": "user", "content": [ {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"}, {"type": "text", "text": "What does the person say?"}, ]}, ] conversations = [conversation1, conversation2] text = [processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) for conversation in conversations] audios = [] for conversation in conversations: for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audios.append( librosa.load( BytesIO(urlopen(ele['audio_url']).read()), sr=processor.feature_extractor.sampling_rate)[0] ) inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True) inputs['input_ids'] = inputs['input_ids'].to("cuda") inputs.input_ids = inputs.input_ids.to("cuda") generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) ``` 运行Qwen2-Audio-7B同样非常简单。 ```python from io import BytesIO from urllib.request import urlopen import librosa from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True) processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True) prompt = "<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:" url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3" audio, sr = librosa.load(BytesIO(urlopen(url).read()), sr=processor.feature_extractor.sampling_rate) inputs = processor(text=prompt, audios=audio, return_tensors="pt") generated_ids = model.generate(**inputs, max_length=256) generated_ids = generated_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ``` #### 🤖 ModelScope 我们强烈建议用户，特别是中国大陆地区的用户，使用 ModelScope。`snapshot_download` 可以帮助您解决下载检查点时遇到的问题。
## Demo ### Web UI 我们提供了 Web UI 的 demo 供用户使用。在开始前，确保已经安装如下代码库： ``` pip install -r requirements_web_demo.txt ``` 随后运行如下命令，并点击生成链接： ``` python demo/web_demo_audio.py ```
## 样例展示更多样例将更新于[通义千问博客](https://qwenlm.github.io/blog/qwen2-audio)中的 Qwen2-Audio 博客。 ## 团队招聘我们是通义千问语音多模态团队，致力于以通义千问为核心，拓展音频多模态理解和生成能力，实现自由灵活的音频交互。目前团队蓬勃发展中，如有意向实习或全职加入我们，请发送简历至 `qwen_audio@list.alibaba-inc.com`.
## 使用协议请查看每个模型在其 Hugging Face 仓库中的许可证。您无需提交商业使用申请。
## 引用如果你觉得我们的论文和代码对你的研究有帮助，请考虑 :star: 和引用 :pencil: :) ```BibTeX @article{Qwen-Audio, title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models}, author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie and Zhou, Chang and Zhou, Jingren}, journal={arXiv preprint arXiv:2311.07919}, year={2023} } ``` ```BibTeX @article{Qwen2-Audio, title={Qwen2-Audio Technical Report}, author={Chu, Yunfei and Xu, Jin and Yang, Qian and Wei, Haojie and Wei, Xipin and Guo, Zhifang and Leng, Yichong and Lv, Yuanjun and He, Jinzheng and Lin, Junyang and Zhou, Chang and Zhou, Jingren}, journal={arXiv preprint arXiv:2407.10759}, year={2024} } ``` ## 联系我们如果你想给我们的研发团队和产品团队留言，请通过邮件（`qianwen_opensource@alibabacloud.com`）联系我们。

Task	Dataset	Model	Performance
Task	Dataset	Model	Metrics	Results
ASR	Librispeech dev-clean \| dev-other \| test-clean \| test-other	SpeechT5	WER	2.1 \| 5.5 \| 2.4 \| 5.8
		SpeechNet		- \| - \| 30.7 \| -
		SLM-FT		- \| - \| 2.6 \| 5.0
		SALMONN		- \| - \| 2.1 \| 4.9
		SpeechVerse		- \| - \| 2.1 \| 4.4
		Qwen-Audio		1.8 \| 4.0 \| 2.0 \| 4.2
		Qwen2-Audio		1.3 \| 3.4 \| 1.6 \| 3.6
	Common Voice 15 en \| zh \| yue \| fr	Whisper-large-v3	WER	9.3 \| 12.8 \| 10.9 \| 10.8
	Common Voice 15 en \| zh \| yue \| fr	Qwen2-Audio	WER	8.6 \| 6.9 \| 5.9 \| 9.6
	Fleurs zh	Whisper-large-v3	WER	7.7
	Fleurs zh	Qwen2-Audio	WER	7.5
	Aishell2 Mic \| iOS \| Android	MMSpeech-base	WER	4.5 \| 3.9 \| 4.0
		Paraformer-large		- \| 2.9 \| -
		Qwen-Audio		3.3 \| 3.1 \| 3.3
		Qwen2-Audio		3.0 \| 3.0 \| 2.9
S2TT	CoVoST2 en-de \| de-en \| en-zh \| zh-en	SALMONN	BLEU	18.6 \| - \| 33.1 \| -
		SpeechLLaMA		- \| 27.1 \| - \| 12.3
		BLSP		14.1 \| - \| - \| -
		Qwen-Audio		25.1 \| 33.9 \| 41.5 \| 15.7
		Qwen2-Audio		29.9 \| 35.2 \| 45.2 \| 24.4
	CoVoST2 es-en \| fr-en \| it-en \|	SpeechLLaMA	BLEU	27.9 \| 25.2 \| 25.9
		Qwen-Audio		39.7 \| 38.5 \| 36.0
		Qwen2-Audio		40.0 \| 38.5 \| 36.3
SER	Meld	WavLM-large	ACC	0.542
		Qwen-Audio		0.557
		Qwen2-Audio		0.553
VSC	VocalSound	CLAP	ACC	0.4945
		Pengi		0.6035
		Qwen-Audio		0.9289
		Qwen2-Audio		0.9392
AIR-Bench	Chat Benchmark Speech \| Sound \| Music \| Mixed-Audio	SALMONN BLSP Pandagpt Macaw-LLM SpeechGPT Next-gpt Qwen-Audio Gemini-1.5-pro Qwen2-Audio	GPT-4	6.16 \| 6.28 \| 5.95 \| 6.08 6.17 \| 5.55 \| 5.08 \| 5.33 3.58 \| 5.46 \| 5.06 \| 4.25 0.97 \| 1.01 \| 0.91 \| 1.01 1.57 \| 0.95 \| 0.95 \| 4.13 3.86 \| 4.76 \| 4.18 \| 4.13 6.47 \| 6.95 \| 5.52 \| 6.08 6.97 \| 5.49 \| 5.06 \| 5.27 7.18 \| 6.99 \| 6.79 \| 6.77