此项目主要使用ESP32-S3实现一个自定义角色的AI语音聊天助手(比如医生角色),可以通过该项目熟悉ESP32-S3 arduino的开发,百度语音识别,百度语音合成API调用,百度APPBuilder API的调用实现自定义角色的方法,自定义唤醒词的训练,SD卡的读写,触摸屏的使用,Wifi的配置(smartconfig方式)等基本开发方法。本项目的所有软硬件工程开源,并配备了详细的教程文档,和对应的视频教程,对零基础的同学非常适用,希望能够帮助到大家。
https://gitee.com/chging/esp32s3-ai-chat
【厚国兄的个人空间-哔哩哔哩】 https://b23.tv/AsFNSeJ
我用夸克网盘分享了「ESP32-S3-AI-Chat-V2.docx」,点击链接即可保存。打开「夸克APP」在线查看,支持多种文档格式转换。
链接:https://pan.quark.cn/s/adc8c266e0a4
购买链接:https://h5.m.taobao.com/awp/core/detail.htm?ft=t&id=833542085705
双击下载的安装软件包,按照如下图所示步骤进行安装:
等待一会,安装完成。
本项目需要安装esp32芯片包,可以先通过在线安装方式进行,如果失败,也可以选择用离线方式进行。
https://raw.githubusercontent.com/espressif/arduino-esp32/gh-pages/package_esp32_dev_index.json
然后点击 确定,保存。
如果一直下载失败,安装失败,则可以通过离线方式进行安装。
我用夸克网盘分享了「esp32.rar」,点击链接即可保存。链接:https://pan.quark.cn/s/61d4a28219bb
C:\Users\用户名\AppData\Local\Arduino15\packages
注意:AppData是个隐藏文件夹,需要配置文件夹查看选项,能够查看隐藏的文件夹。我这里的用户名Administrator。
本项目需要安装下面的在线库和离线库。
arduino可以直接在库管理器中进行搜索所需的库的名字进行安装。
库名称 | 版本 |
---|---|
ArduinoJson | 7.1.0 |
base64 | 1.3.0 |
UrlEncode | 1.0.1 |
lvgl | 8.5.10 |
TFT_eSPI | 2.5.43 |
bb_captouch | 1.2.2 |
修改 User_Setup_Select.h。在arduino的库安装文件夹中。
将</font>lvgl\demos文件夹移动至\lvgl\src\demos\中去。
我用夸克网盘分享了「lv_font_SiYuanHeiTi_Normal_16.c」,点击链接即可保存。打开「夸克APP」在线查看,支持多种文档格式转换。
链接:https://pan.quark.cn/s/3ffcf83d30aa
我用夸克网盘分享了「test_ui.zip」,点击链接即可保存。
链接:https://pan.quark.cn/s/fd4657269252
将压缩包下载下来解压缩,然后复制到lvgl\src下。
arduino可以直接导入离线的库文件进行安装,本项目需要安装的离线库主要为训练的唤醒词库。下面是我自己训练的,如果大家没有训练的话可以先用我的这个唤醒词库进行导入,如果想要训练自己的唤醒词,详细的训练见后面第6章节。
库名称 |
---|
wakeup_detect_houguoxiong_inferencing |
在调用百度api之前,我们需要在百度的百度智能云平台上面申请api key,申请通过后并且开通对应的api调用服务接口,才可以进行api的访问。
百度智能云平台网址:https://cloud.baidu.com/
我们首先需要创建语音识别的api key。
百度语音合成的api key与语音识别是同一个,所以上一节创建成功后,我们可以直接使用了。但是服务是需要另外开通。
大模型应用Agent的api key申请同样在百度智能云平台上。
在开通上面的api服务后,我们可以在线测试一下服务是否开通成功。
在这里我们可以首先去测试语音合成的api调用,因为这个api的调用我们可以直接填写文本作为输入,而语音识别是需要传入音频数据作为输入。因此这里我们先测试语音合成,在语音合成生成后的音频则可以保存下来作为接下来语音识别的输入的测试。
api访问有一个通用的工具,这个工具用的比较广泛,可以专门测试api访问接口服务是否正常的。
点击 示例代码->Curl->复制 代码。
用arduino开发esp32的时候,因为使用的都是各种的库,所以很可能为了实现功能而加了一堆库导致代码很大,超过了用来存储代码的空间,所以需要重新设置一下各个分区的大小。
C:\Users\Administrator\AppData\Local\Arduino15\packages\esp32\hardware\esp32\2.0.13 (这里的Administrator为自己电脑上的用户名)
如下图的位置,根据其它flash分配的配置方法添加下面的配置:
esp32s3.menu.PartitionScheme.default_16MB=16M with spiffs (6MB APP/3MB SPIFFS)
esp32s3.menu.PartitionScheme.default_16MB.build.partitions=default_16MB
esp32s3.menu.PartitionScheme.default_16MB.upload.maximum_size=6553600
只有arduino ide版本2.0以上才需要如下操作,其他版本重启arduino ide即可
C:\Users\Administrator\AppData\Roaming\arduino-ide\Local Storage\leveldb
删除上述链接的文件再重启arduino ide就能获取到新设置的分区配置,这里删除的是保存的一些缓存数据。
至此,flash的分区已经修改完毕,可以支持较大程序的编译了。
AI Agent角色扮演主程序的工程在**esp32s3-ai-chat\ai-all\esp32s3_ai_chat_all\esp32s3_ai_chat_all.ino**,我们直接打开工程,修改wifi及api key等配置信息后,编译代码,烧写代码到板子中进行测试。
点击上方的开发板选择栏,选择开发板和端口,开发板选择ESP32S3 Dev Module,端口选择USB typeC连接后的串口显示的对应串口(可通过设备管理器查看)。
用 "houguoxiong" 唤醒ESP32-S3进行对话。或者直接触摸屏幕(一直触摸进行录音,松开就结束录音,系统开始将录音进行语音识别及大模型访问之类的流程),唤醒ESP32-S3进行对话。
该模块主要实现自己训练的唤醒词唤醒的功能。**esp32s3-ai-chat/example/wake_detect**这个工程主要就是实现唤醒词唤醒的功能。基于这个工程,我们可以在此基础上进行AI语音聊天的开发。
整体代码实现:
#include <wakeup_detect_houguoxiong_inferencing.h>
/* Edge Impulse Arduino examples
* Copyright (c) 2022 EdgeImpulse Inc.
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
* SOFTWARE.
*/
// If your target is limited in memory remove this macro to save 10K RAM
#define EIDSP_QUANTIZE_FILTERBANK 0
/*
** NOTE: If you run into TFLite arena allocation issue.
**
** This may be due to may dynamic memory fragmentation.
** Try defining "-DEI_CLASSIFIER_ALLOCATION_STATIC" in boards.local.txt (create
** if it doesn't exist) and copy this file to
** `<ARDUINO_CORE_INSTALL_PATH>/arduino/hardware/<mbed_core>/<core_version>/`.
**
** See
** (https://support.arduino.cc/hc/en-us/articles/360012076960-Where-are-the-installed-cores-located-)
** to find where Arduino installs cores on your machine.
**
** If the problem persists then there's not enough memory for this model and application.
*/
/* Includes ---------------------------------------------------------------- */
#include <driver/i2s.h>
#define SAMPLE_RATE 16000U
#define LED_BUILT_IN 21
// INMP441 config
#define I2S_IN_PORT I2S_NUM_0
#define I2S_IN_BCLK 4
#define I2S_IN_LRC 5
#define I2S_IN_DIN 6
/** Audio buffers, pointers and selectors */
typedef struct {
int16_t *buffer;
uint8_t buf_ready;
uint32_t buf_count;
uint32_t n_samples;
} inference_t;
static inference_t inference;
static const uint32_t sample_buffer_size = 2048;
static signed short sampleBuffer[sample_buffer_size];
static bool debug_nn = false; // Set this to true to see e.g. features generated from the raw signal
static bool record_status = true;
/**
* @brief Arduino setup function
*/
void setup() {
// put your setup code here, to run once:
Serial.begin(115200);
// comment out the below line to cancel the wait for USB connection (needed for native USB)
while (!Serial)
;
Serial.println("Edge Impulse Inferencing Demo");
pinMode(LED_BUILT_IN, OUTPUT); // Set the pin as output
digitalWrite(LED_BUILT_IN, HIGH); //Turn off
// Initialize I2S for audio input
i2s_config_t i2s_config_in = {
.mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX),
.sample_rate = SAMPLE_RATE,
.bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT, // 注意:INMP441 输出 32 位数据
.channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
.communication_format = i2s_comm_format_t(I2S_COMM_FORMAT_STAND_I2S),
.intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
.dma_buf_count = 8,
.dma_buf_len = 1024,
};
i2s_pin_config_t pin_config_in = {
.bck_io_num = I2S_IN_BCLK,
.ws_io_num = I2S_IN_LRC,
.data_out_num = -1,
.data_in_num = I2S_IN_DIN
};
i2s_driver_install(I2S_IN_PORT, &i2s_config_in, 0, NULL);
i2s_set_pin(I2S_IN_PORT, &pin_config_in);
// summary of inferencing settings (from model_metadata.h)
ei_printf("Inferencing settings:\n");
ei_printf("\tInterval: ");
ei_printf_float((float)EI_CLASSIFIER_INTERVAL_MS);
ei_printf(" ms.\n");
ei_printf("\tFrame size: %d\n", EI_CLASSIFIER_DSP_INPUT_FRAME_SIZE);
ei_printf("\tSample length: %d ms.\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT / 16);
ei_printf("\tNo. of classes: %d\n", sizeof(ei_classifier_inferencing_categories) / sizeof(ei_classifier_inferencing_categories[0]));
ei_printf("\nStarting continious inference in 2 seconds...\n");
ei_sleep(2000);
if (microphone_inference_start(EI_CLASSIFIER_RAW_SAMPLE_COUNT) == false) {
ei_printf("ERR: Could not allocate audio buffer (size %d), this could be due to the window length of your model\r\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT);
return;
}
ei_printf("Recording...\n");
}
/**
* @brief Arduino main function. Runs the inferencing loop.
*/
void loop() {
bool m = microphone_inference_record();
if (!m) {
ei_printf("ERR: Failed to record audio...\n");
return;
}
signal_t signal;
signal.total_length = EI_CLASSIFIER_RAW_SAMPLE_COUNT;
signal.get_data = µphone_audio_signal_get_data;
ei_impulse_result_t result = { 0 };
EI_IMPULSE_ERROR r = run_classifier(&signal, &result, debug_nn);
if (r != EI_IMPULSE_OK) {
ei_printf("ERR: Failed to run classifier (%d)\n", r);
return;
}
int pred_index = 0; // Initialize pred_index
float pred_value = 0; // Initialize pred_value
// print the predictions
ei_printf("Predictions ");
ei_printf("(DSP: %d ms., Classification: %d ms., Anomaly: %d ms.)",
result.timing.dsp, result.timing.classification, result.timing.anomaly);
ei_printf(": \n");
for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
ei_printf(" %s: ", result.classification[ix].label);
ei_printf_float(result.classification[ix].value);
ei_printf("\n");
if (result.classification[ix].value > pred_value) {
pred_index = ix;
pred_value = result.classification[ix].value;
}
}
// Display inference result
if (pred_index == 3) {
digitalWrite(LED_BUILT_IN, LOW); //Turn on
} else {
digitalWrite(LED_BUILT_IN, HIGH); //Turn off
}
#if EI_CLASSIFIER_HAS_ANOMALY == 1
ei_printf(" anomaly score: ");
ei_printf_float(result.anomaly);
ei_printf("\n");
#endif
}
static void audio_inference_callback(uint32_t n_bytes) {
for (int i = 0; i < n_bytes >> 1; i++) {
inference.buffer[inference.buf_count++] = sampleBuffer[i];
if (inference.buf_count >= inference.n_samples) {
inference.buf_count = 0;
inference.buf_ready = 1;
}
}
}
static void capture_samples(void *arg) {
const int32_t i2s_bytes_to_read = (uint32_t)arg;
size_t bytes_read = i2s_bytes_to_read;
while (record_status) {
/* read data at once from i2s - Modified for XIAO ESP2S3 Sense and I2S.h library */
i2s_read(I2S_IN_PORT, (void*)sampleBuffer, i2s_bytes_to_read, &bytes_read, 100);
// esp_i2s::i2s_read(esp_i2s::I2S_NUM_0, (void *)sampleBuffer, i2s_bytes_to_read, &bytes_read, 100);
if (bytes_read <= 0) {
ei_printf("Error in I2S read : %d", bytes_read);
} else {
if (bytes_read < i2s_bytes_to_read) {
ei_printf("Partial I2S read");
}
// scale the data (otherwise the sound is too quiet)
for (int x = 0; x < i2s_bytes_to_read / 2; x++) {
sampleBuffer[x] = (int16_t)(sampleBuffer[x]) * 8;
}
if (record_status) {
audio_inference_callback(i2s_bytes_to_read);
} else {
break;
}
}
}
vTaskDelete(NULL);
}
/**
* @brief Init inferencing struct and setup/start PDM
*
* @param[in] n_samples The n samples
*
* @return { description_of_the_return_value }
*/
static bool microphone_inference_start(uint32_t n_samples) {
inference.buffer = (int16_t *)malloc(n_samples * sizeof(int16_t));
if (inference.buffer == NULL) {
return false;
}
inference.buf_count = 0;
inference.n_samples = n_samples;
inference.buf_ready = 0;
// if (i2s_init(EI_CLASSIFIER_FREQUENCY)) {
// ei_printf("Failed to start I2S!");
// }
ei_sleep(100);
record_status = true;
xTaskCreate(capture_samples, "CaptureSamples", 1024 * 32, (void *)sample_buffer_size, 10, NULL);
return true;
}
/**
* @brief Wait on new data
*
* @return True when finished
*/
static bool microphone_inference_record(void) {
bool ret = true;
while (inference.buf_ready == 0) {
delay(10);
}
inference.buf_ready = 0;
return ret;
}
/**
* Get raw audio signal data
*/
static int microphone_audio_signal_get_data(size_t offset, size_t length, float *out_ptr) {
numpy::int16_to_float(&inference.buffer[offset], out_ptr, length);
return 0;
}
/**
* @brief Stop PDM and release buffers
*/
static void microphone_inference_end(void) {
free(sampleBuffer);
ei_free(inference.buffer);
}
#if !defined(EI_CLASSIFIER_SENSOR) || EI_CLASSIFIER_SENSOR != EI_CLASSIFIER_SENSOR_MICROPHONE
#error "Invalid model for current sensor."
#endif
下面进行各模块代码的介绍:
#include <wakeup_detect_houguotongxue_inferencing.h>
// Initialize I2S for audio input
i2s_config_t i2s_config_in = {
.mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX),
.sample_rate = SAMPLE_RATE,
.bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT, // 注意:INMP441 输出 32 位数据
.channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
.communication_format = i2s_comm_format_t(I2S_COMM_FORMAT_STAND_I2S),
.intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
.dma_buf_count = 8,
.dma_buf_len = 1024,
};
i2s_pin_config_t pin_config_in = {
.bck_io_num = I2S_IN_BCLK,
.ws_io_num = I2S_IN_LRC,
.data_out_num = -1,
.data_in_num = I2S_IN_DIN
};
i2s_driver_install(I2S_IN_PORT, &i2s_config_in, 0, NULL);
i2s_set_pin(I2S_IN_PORT, &pin_config_in);
// summary of inferencing settings (from model_metadata.h)
ei_printf("Inferencing settings:\n");
ei_printf("\tInterval: ");
ei_printf_float((float)EI_CLASSIFIER_INTERVAL_MS);
ei_printf(" ms.\n");
ei_printf("\tFrame size: %d\n", EI_CLASSIFIER_DSP_INPUT_FRAME_SIZE);
ei_printf("\tSample length: %d ms.\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT / 16);
ei_printf("\tNo. of classes: %d\n", sizeof(ei_classifier_inferencing_categories) / sizeof(ei_classifier_inferencing_categories[0]));
ei_printf("\nStarting continious inference in 2 seconds...\n");
ei_sleep(2000);
if (microphone_inference_start(EI_CLASSIFIER_RAW_SAMPLE_COUNT) == false) {
ei_printf("ERR: Could not allocate audio buffer (size %d), this could be due to the window length of your model\r\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT);
return;
}
static bool microphone_inference_start(uint32_t n_samples) {
inference.buffer = (int16_t *)malloc(n_samples * sizeof(int16_t));
if (inference.buffer == NULL) {
return false;
}
inference.buf_count = 0;
inference.n_samples = n_samples;
inference.buf_ready = 0;
ei_sleep(100);
record_status = true;
xTaskCreate(capture_samples, "CaptureSamples", 1024 * 32, (void *)sample_buffer_size, 10, NULL);
return true;
}
static void capture_samples(void *arg) {
const int32_t i2s_bytes_to_read = (uint32_t)arg;
size_t bytes_read = i2s_bytes_to_read;
while (record_status) {
/* read data at once from i2s - Modified for XIAO ESP2S3 Sense and I2S.h library */
i2s_read(I2S_IN_PORT, (void*)sampleBuffer, i2s_bytes_to_read, &bytes_read, 100);
// esp_i2s::i2s_read(esp_i2s::I2S_NUM_0, (void *)sampleBuffer, i2s_bytes_to_read, &bytes_read, 100);
if (bytes_read <= 0) {
ei_printf("Error in I2S read : %d", bytes_read);
} else {
if (bytes_read < i2s_bytes_to_read) {
ei_printf("Partial I2S read");
}
// scale the data (otherwise the sound is too quiet)
for (int x = 0; x < i2s_bytes_to_read / 2; x++) {
sampleBuffer[x] = (int16_t)(sampleBuffer[x]) * 8;
}
if (record_status) {
audio_inference_callback(i2s_bytes_to_read);
} else {
break;
}
}
}
vTaskDelete(NULL);
}
static void audio_inference_callback(uint32_t n_bytes) {
for (int i = 0; i < n_bytes >> 1; i++) {
inference.buffer[inference.buf_count++] = sampleBuffer[i];
if (inference.buf_count >= inference.n_samples) {
inference.buf_count = 0;
inference.buf_ready = 1;
}
}
}
void loop() {
bool m = microphone_inference_record();
if (!m) {
ei_printf("ERR: Failed to record audio...\n");
return;
}
signal_t signal;
signal.total_length = EI_CLASSIFIER_RAW_SAMPLE_COUNT;
signal.get_data = µphone_audio_signal_get_data;
ei_impulse_result_t result = { 0 };
EI_IMPULSE_ERROR r = run_classifier(&signal, &result, debug_nn);
if (r != EI_IMPULSE_OK) {
ei_printf("ERR: Failed to run classifier (%d)\n", r);
return;
}
int pred_index = 0; // Initialize pred_index
float pred_value = 0; // Initialize pred_value
// print the predictions
ei_printf("Predictions ");
ei_printf("(DSP: %d ms., Classification: %d ms., Anomaly: %d ms.)",
result.timing.dsp, result.timing.classification, result.timing.anomaly);
ei_printf(": \n");
for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
ei_printf(" %s: ", result.classification[ix].label);
ei_printf_float(result.classification[ix].value);
ei_printf("\n");
if (result.classification[ix].value > pred_value) {
pred_index = ix;
pred_value = result.classification[ix].value;
}
}
// Display inference result
if (pred_index == 3) {
digitalWrite(LED_BUILT_IN, LOW); //Turn on
} else {
digitalWrite(LED_BUILT_IN, HIGH); //Turn off
}
#if EI_CLASSIFIER_HAS_ANOMALY == 1
ei_printf(" anomaly score: ");
ei_printf_float(result.anomaly);
ei_printf("\n");
#endif
}
在访问百度的语音识别、语音合成、文心一言大模型时,都需要提供access_token。在ESP32-S3中,我们通过创建http请求,根据access_token的api访问格式构建请求包,通过http发送请求,等待响应的数据,然后从响应的数据中解析出access_token。
整体代码实现如下:
// Get Baidu API access token
String getAccessToken(const char* api_key, const char* secret_key) {
String access_token = "";
HTTPClient http;
// 创建http请求
http.begin("https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=" + String(api_key) + "&client_secret=" + String(secret_key));
int httpCode = http.POST("");
if (httpCode == HTTP_CODE_OK) {
String response = http.getString();
DynamicJsonDocument doc(1024);
deserializeJson(doc, response);
access_token = doc["access_token"].as<String>();
Serial.printf("[HTTP] GET access_token: %s\n", access_token);
} else {
Serial.printf("[HTTP] GET... failed, error: %s\n", http.errorToString(httpCode).c_str());
}
http.end();
return access_token;
}
在这里,我们需要先在百度智能云网站上申请api_key、secret_key,参考第5章节的操作方式。然后将api_key、secret_key作为输入参数,根据api访问格式,发送http.POST请求,然后从响应数据中解析出access_token。
在ESP32-S3通过i2s采集INMP441的音频数据后,需要将采集的音频数据流识别为文本模式,因此需要调用语音识别API实现实时的语音识别,这里我们采用了百度的语音识别API访问。
主要代码实现如下:
String baiduSTT_Send(String access_token, uint8_t* audioData, int audioDataSize) {
String recognizedText = "";
if (access_token == "") {
Serial.println("access_token is null");
return recognizedText;
}
// audio数据包许愿哦进行Base64编码,数据量会增大1/3
int audio_data_len = audioDataSize * sizeof(char) * 1.4;
unsigned char* audioDataBase64 = (unsigned char*)ps_malloc(audio_data_len);
if (!audioDataBase64) {
Serial.println("Failed to allocate memory for audioDataBase64");
return recognizedText;
}
// json包大小,由于需要将audioData数据进行Base64的编码,数据量会增大1/3
int data_json_len = audioDataSize * sizeof(char) * 1.4;
char* data_json = (char*)ps_malloc(data_json_len);
if (!data_json) {
Serial.println("Failed to allocate memory for data_json");
return recognizedText;
}
// Base64 encode audio data
encode_base64(audioData, audioDataSize, audioDataBase64);
memset(data_json, '\0', data_json_len);
strcat(data_json, "{");
strcat(data_json, "\"format\":\"pcm\",");
strcat(data_json, "\"rate\":16000,");
strcat(data_json, "\"dev_pid\":1537,");
strcat(data_json, "\"channel\":1,");
strcat(data_json, "\"cuid\":\"57722200\",");
strcat(data_json, "\"token\":\"");
strcat(data_json, access_token.c_str());
strcat(data_json, "\",");
sprintf(data_json + strlen(data_json), "\"len\":%d,", audioDataSize);
strcat(data_json, "\"speech\":\"");
strcat(data_json, (const char*)audioDataBase64);
strcat(data_json, "\"");
strcat(data_json, "}");
// 创建http请求
HTTPClient http_client;
http_client.begin("http://vop.baidu.com/server_api");
http_client.addHeader("Content-Type", "application/json");
int httpCode = http_client.POST(data_json);
if (httpCode > 0) {
if (httpCode == HTTP_CODE_OK) {
// 获取返回结果
String response = http_client.getString();
Serial.println(response);
// 从json中解析对应的result
DynamicJsonDocument responseDoc(2048);
deserializeJson(responseDoc, response);
recognizedText = responseDoc["result"].as<String>();
}
} else {
Serial.printf("[HTTP] POST failed, error: %s\n", http_client.errorToString(httpCode).c_str());
}
// 释放内存
if (audioDataBase64) {
free(audioDataBase64);
}
if (data_json) {
free(data_json);
}
http_client.end();
return recognizedText;
}
下面对上面代码重点地方进行分析说明:
// audio数据包许愿哦进行Base64编码,数据量会增大1/3
int audio_data_len = audioDataSize * sizeof(char) * 1.4;
unsigned char* audioDataBase64 = (unsigned char*)ps_malloc(audio_data_len);
if (!audioDataBase64) {
Serial.println("Failed to allocate memory for audioDataBase64");
return recognizedText;
}
// json包大小,由于需要将audioData数据进行Base64的编码,数据量会增大1/3
int data_json_len = audioDataSize * sizeof(char) * 1.4;
char* data_json = (char*)ps_malloc(data_json_len);
if (!data_json) {
Serial.println("Failed to allocate memory for data_json");
return recognizedText;
}
// Base64 encode audio data
encode_base64(audioData, audioDataSize, audioDataBase64);
memset(data_json, '\0', data_json_len);
strcat(data_json, "{");
strcat(data_json, "\"format\":\"pcm\",");
strcat(data_json, "\"rate\":16000,");
strcat(data_json, "\"dev_pid\":1537,");
strcat(data_json, "\"channel\":1,");
strcat(data_json, "\"cuid\":\"57722200\",");
strcat(data_json, "\"token\":\"");
strcat(data_json, access_token.c_str());
strcat(data_json, "\",");
sprintf(data_json + strlen(data_json), "\"len\":%d,", audioDataSize);
strcat(data_json, "\"speech\":\"");
strcat(data_json, (const char*)audioDataBase64);
strcat(data_json, "\"");
strcat(data_json, "}");
// 从json中解析对应的result
DynamicJsonDocument responseDoc(2048);
deserializeJson(responseDoc, response);
recognizedText = responseDoc["result"].as<String>();
语音识别会以文本的格式返回识别的结果,然后我们可以用这个作为百度大模型Agent的api的输入。大模型api的调用代码实现如下:
// Get Baidu API conversation id
String getConversation_id(const char* api_key, const char* app_id) {
String conversation_id = "";
// 创建http请求
HTTPClient http;
http.begin("https://qianfan.baidubce.com/v2/app/conversation");
http.addHeader("Content-Type", "application/json");
http.addHeader("X-Appbuilder-Authorization", "Bearer " + String(api_key));
// 创建一个 JSON 文档
DynamicJsonDocument requestJson(1024);
requestJson["app_id"] = app_id;
// 将 JSON 数据序列化为字符串
String requestBody;
serializeJson(requestJson, requestBody);
// 发送http访问请求
int httpCode = http.POST(requestBody);
if (httpCode == HTTP_CODE_OK) {
String response = http.getString();
DynamicJsonDocument doc(1024);
deserializeJson(doc, response);
conversation_id = doc["conversation_id"].as<String>();
ei_printf("[HTTP] GET conversation_id: %s\n", conversation_id.c_str());
} else {
ei_printf("[HTTP] GET... failed, error: %s\n", http.errorToString(httpCode).c_str());
}
http.end();
return conversation_id;
}
从百度文心一言api返回的文本数据,我们需要通过扬声器播放出来,因此需要将文本数据转化为音频数据输出。这里我们通过调用百度语音合成api接口,实现文本转音频的功能。主要代码实现如下:
void baiduTTS_Send(String access_token, String text) {
if (access_token == "") {
Serial.println("access_token is null");
return;
}
if (text.length() == 0) {
Serial.println("text is null");
return;
}
const int per = 1;
const int spd = 5;
const int pit = 5;
const int vol = 10;
const int aue = 6;
// 进行 URL 编码
String encodedText = urlEncode(urlEncode(text));
// URL http请求数据封装
String url = "https://tsn.baidu.com/text2audio";
const char* header[] = { "Content-Type", "Content-Length" };
url += "?tok=" + access_token;
url += "&tex=" + encodedText;
url += "&per=" + String(per);
url += "&spd=" + String(spd);
url += "&pit=" + String(pit);
url += "&vol=" + String(vol);
url += "&aue=" + String(aue);
url += "&cuid=esp32s3";
url += "&lan=zh";
url += "&ctp=1";
// http请求创建
HTTPClient http;
http.begin(url);
http.collectHeaders(header, 2);
// http请求
int httpResponseCode = http.GET();
if (httpResponseCode > 0) {
if (httpResponseCode == HTTP_CODE_OK) {
String contentType = http.header("Content-Type");
Serial.println(contentType);
if (contentType.startsWith("audio")) {
Serial.println("合成成功");
// 获取返回的音频数据流
Stream* stream = http.getStreamPtr();
uint8_t buffer[512];
size_t bytesRead = 0;
// 设置timeout为200ms 避免最后出现杂音
stream->setTimeout(200);
while (http.connected() && (bytesRead = stream->readBytes(buffer, sizeof(buffer))) > 0) {
// 音频输出
playAudio(buffer, bytesRead);
delay(1);
}
// 清空I2S DMA缓冲区
clearAudio();
} else if (contentType.equals("application/json")) {
Serial.println("合成出现错误");
} else {
Serial.println("未知的Content-Type");
}
} else {
Serial.println("Failed to receive audio file");
}
} else {
Serial.print("Error code: ");
Serial.println(httpResponseCode);
}
http.end();
}
// Play audio data using MAX98357A
void playAudio(uint8_t* audioData, size_t audioDataSize) {
if (audioDataSize > 0) {
// 发送
size_t bytes_written = 0;
i2s_write(I2S_OUT_PORT, (int16_t*)audioData, audioDataSize, &bytes_written, portMAX_DELAY);
}
}
void clearAudio(void) {
// 清空I2S DMA缓冲区
i2s_zero_dma_buffer(I2S_OUT_PORT);
Serial.print("clearAudio");
}
下面对上面代码重点地方进行分析说明:
// 进行 URL 编码
String encodedText = urlEncode(urlEncode(text));
// URL http请求数据封装
String url = "https://tsn.baidu.com/text2audio";
const char* header[] = { "Content-Type", "Content-Length" };
url += "?tok=" + access_token;
url += "&tex=" + encodedText;
url += "&per=" + String(per);
url += "&spd=" + String(spd);
url += "&pit=" + String(pit);
url += "&vol=" + String(vol);
url += "&aue=" + String(aue);
url += "&cuid=esp32s3";
url += "&lan=zh";
url += "&ctp=1";
// http请求创建
HTTPClient http;
http.begin(url);
http.collectHeaders(header, 2);
// 设置timeout为200ms 避免最后出现杂音
stream->setTimeout(200);
while (http.connected() && (bytesRead = stream->readBytes(buffer, sizeof(buffer))) > 0) {
// 音频输出
playAudio(buffer, bytesRead);
delay(1);
}
void clearAudio(void) {
// 清空I2S DMA缓冲区
i2s_zero_dma_buffer(I2S_OUT_PORT);
Serial.print("clearAudio");
}
需要准备以下硬件:
我们通过烧写录制音频软件到ESP32-S3中进行录制音频数据,录制的音频数据会保存到microSD卡中,然后我们可以通过电脑去读取出来。
音频录制软件工程在**esp32s3-ai-chat/example/capture_audio_data** 下。打开工程文件,在工程编译前,我们使用到了psram,因此需要**打开psram启动开关**,如下图所示,设置好后编译并且烧录到ESP32-S3中去。
至少录制**3个标签的样本数据,唤醒词标签样本、噪音标签样本、未知标签样本,每个标签样本数据各自最少录制10组**,数据越多,训练后的模型识别能力越好。
登录Edge Impulse,https://edgeimpulse.com/,注册账号,并创建工程。
创建项目后,在Data Acquisition部分选择Upload Existing Data工具。选择要上传的文件。
数据上传完成后,系统会自动将它们分为Training、Test集(按照80% 20%的比例)。
系统进行训练的数据集要求所有数据长度都是1秒,但上一节中上传的记录样本长度为10秒,必须将其拆分为1秒的样本才能兼容,因此需要对每个样本数据进行分割的操作。如下图所示:
创建脉冲信号,进行数据预处理和模型的选择,如下图所示:
操作步骤如下:
下一步是创建下一阶段要训练的图像。
接着,我们需要对模型的结构进行设计和开始训练,步骤如下:
在模型训练好后,我们可以用测试集数据测试一下训练好的模型的准确率。
模型训练完成后,我们需要生成在arduino esp32平台上运行的库文件。
训练好唤醒词库文件后,我们需要在ESP32-S3上测试一下唤醒词功能。打开工程目录下的
**esp32s3-ai-chat/example/wake_detect**工程,进行唤醒功能测试。
当在arduino工程中需要替换新训练的唤醒词模型时,我们可以删除掉之前导入的模型库文件,然后删掉工程中之前引用的头文件,最后重新按照之前7.3.2章节导入模型库并引用的方式操作一遍即可。
删除库文件在该目录下直接删除导入的库文件夹,然后关闭arduino软件,重新打开arduino工程就可以。
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。