# keyword-spotting_2

**Repository Path**: mingyuzz/keyword-spotting_2

## Basic Information

- **Project Name**: keyword-spotting_2
- **Description**: 1111111111111111111111111111111
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 1
- **Created**: 2022-10-25
- **Last Updated**: 2024-05-11

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Real-Time Keyword Spotting

## Introduction

In this repository, I implement in Python/Keras a system for detecting specific spoken words in speech signals, task known as **keyword spotting** (KWS). When reading a speech signal, I detect not only the presence, but also the time position of the keyword. For this purpose, I use a CNN-RNN network, with a CTC (Connectionist Temporal Classification) loss function [1]. As the system is causal (I use only unidirectional LSTMs, and convolutional operations are causal with some delay), I also make an implementation in real-time.

Traditionally, keyword spotting and other speech detection tasks are usually implemented with hidden Markov models (HMM). Since [2,3,4], there has been much more attention in solving this problem with deep convolutional or recurrent networks, performing with higher accuracy and with smaller footprint. Some of these approaches [2,3], however, depends on the annotation of every frame of the input signal as one of the keywords or as a non-keyword frame. In these works, the annotation is performed at a pre-training on a HMM system, from which the frame-wise labeling is obtained, and then the deep neural network can be trained and predict with few computational burden.

In this work, I use a CNN-RNN network with a CTC loss function [1], as in [5]. Such approach exempts the need of a label on each input signal frame, since the CTC forward-backward algorithm aligns the input sequence with the label sequence so as to minimize its loss function.

The input speech signal is preprocessed to extract 13 mel-frequency cepstral coefficents (MFCC) for each frame. The MFCCs are preferably used in speech recognition tasks because it captures well the distinctions between phonemes and is more compact than a direct usage of the spectrogram.

In this test, I specifically trained the model to identify the following eight keywords (arbitrarily chosen from publicly available datasets): `house`, `right`, `down`, `left`, `no`, `five`, `one` and `three`. To train the network, I use a subset of the Google Speech Commands dataset [6] that contains isolated speech data for these keywords, and to train the detection of the keywords within entire spoken sentences, I use the Librispeech dataset [7].

As the occurences of the keywords in Librispeech is very low, the network is induced to predict non-keywords almost all the time, yielding an undesirable behavior. To remedy this issue, I restrict the number of input samples of Librispeech that does not contain any keyword, and create a third dataset: I concatenate multiple speech signals from Speech Commands, keywords or not, to form sequences with multiple occurences of keywords in the same signal sample. This additional modification of the dataset improved a lot the prediction performance of the model.

## Usage Instructions
### (1) Install Required Python Packages

Install the required packages by running the following:

`pip install -r requirements.txt`

If you are in Windows, you might find trouble installing PyAudio (required only for real-time implementation). If this is the case, you must download the appropriate `.whl` file directly from the PyAudio repository, as instructed in

https://stackoverflow.com/questions/52283840/i-cant-install-pyaudio-on-windows-how-to-solve-error-microsoft-visual-c-14

### (2) Download Datasets

First, download the datasets you will use to train your model. Any dataset can be used, although it is easier to use Google Speech Commands [6] and LibriSpeech [7] (I used the `train-clean-100.tar.gz` subset), because I already implemented functions to read these specific datasets in `functions_datasets.py`.

Optionally, you can also download a dataset of additional background noise. I used MS-SNSD [8] (`noise-train`) in my tests.

(This additional noise would intuitively improve the robustness of the detection. However, in my tests, it did not improve
 the evauation metrics, and actually even degraded the online performance I tested with my microphone.)

### (3) Convert LibriSpeech Files to WAV

LibriSpeech files are originally in FLAC format. As I use Tensorflow audio decoder, the files must be in WAV format. Just move 
`flac2wav.sh` to the dataset folder and run it (it uses 
avconv program from libav http://builds.libav.org/windows/nightly-gpl/, in case you don't already have it). If you are in Unix system,
 just run

`./flac2wav.sh`

or, if you are in Windows,

`powershell ./flac2wav.sh`

### (4) Prepare Datasets

To improve performance, I generate an additional dataset with multiple keywords per audio sample out of Google Speech Commands. 
You can generate it by running `create_dataset.py`. In the first lines of the code, you can edit some settings, such as the folder paths,
 the keywords you want to use, the number of samples per keyword, etc.

When training or evaluating the model, the codes read a file with the path and transcription of previously selected audio files. 
You generate this file by running `prepare_datasets.py`, which also generates a file containing the some info of the datasets,
 such as the number of samples per keyword. You can edit some settings, such as dataset paths, selected keywords and samples restriction 
 in the first lines of  the code.

### (5) Train model

To train the model, just run `kws_train.py`. Training parameters are set via parsing, such as

`python kws_train.py -d data -r result_01 -m 2 --train_test_split 0.7 --epochs 50`

where `data` is the folder generated by `prepare_datasets.py`, `result_01` is the folder to store the result files (model weights and history), 2 is the chosen network model, a fraction of 0.7 of the dataset is taken as training set, and the training runs 50 epochs. Other parameters can be verified at the beggining of the code in `kws_train.py`.

Currently, there is only a few non-parametric built-in network models, defined in `models.py`, all of them with convolutional and recurrent layers. Some of them use Conv1D, intended to be used with MFCC feature, and others use Conv2D, intended to be used along with mel-spectrogram feature.

### (6) Evaluation and Prediction

To evaluate the model, I count the number of correct detections (true positives), the number of false positives and false negatives at each test sample, for each keyword, and compute precision, recall and F1-score. The global metrics for the model is then computed as the average of the metrics for each keyword.

To evaluate the model, run `kws_evaluate.py`, for example, by

`python kws_evaluate.py -d data -r result_01`

It uses only the portion of data considered as validation during the training of the referenced result folder. It also generates a file `performance.txt` in the result folder recording the performance results.

In order to visualize the network outputs, you can run `kws_predict.py`, which prints, for some selected samples of the dataset, the audio signal and the network output token probabilities with Matplotlib, along with the label token sequence and the predicted sequence.

### (7) Real-time Implementation

I also made a real-time implementation of the network, using PyAudio to capture the audio stream from a microphone device. It can be run as, for example,

`python kws_realtime.py -r result_01`

which runs using the model parameters stored in the result folder. Currently, it only uses the model number 2, which apparently gives the best result.

## About the model

I have tested some different network models, but they all have a similar concept of using some convolutional layers followed by recurrent ones. In the following, I describe the model number 2, which has given the best results in the tests.

 - MFCC feature extraction;
 - Batch normalization;
 - Convolutional block below, with N = 32 filters;
 - Convolutional block below, with N = 64 filters;
 - Convolutional block below, with N = 128 filters;
 - Unidirectional LSTM layer with 128 units, with dropout rate 0.25;
 - Unidirectional LSTM layer with 128 units, with dropout rate 0.25;
 - Dense layer with 10 units (8 keyword tokens, 1 non-keyword token and CTC null token) with softmax activation.

<img src="https://github.com/ryuuji06/keyword-spotting/blob/main/examples/figures/block_diagram_conv1.png" width="150">

## Sample results

The figure below illustrates a speech signal and their features (mel-spectrogram and MFCCs). It is an example for the spoken word "left".

<img src="https://github.com/ryuuji06/keyword-spotting/blob/main/examples/figures/illust_features_1.png" width="500">

The figure below shows the learning curve for the models within the `example` folder. Note that model 2 performs the best. All models are trained with no additive augmentation noise, except `Model 2 noise`, in which a background noise was added in 50% of the training and validations samples.

<img src="https://github.com/ryuuji06/keyword-spotting/blob/main/examples/figures/learning_curves.png" width="700">

The table below depicts the performance of each model, considering the metrics precision, recall and F1-score. These metrics were computed by counting the total number of true positive, false positives and false negatives detections for each keyword, and taking the mean.

<img src="https://github.com/ryuuji06/keyword-spotting/blob/main/examples/figures/performance_table.png" width="350">

The next figure shows the network outputs (probabilities of each token) when inputing a sample signal from Librispeech, with the sentence

`certainly not to walk three miles or four miles or five miles.`

The last token (dashed black line) is the null character token, inherent of the CTC algorithm, and encodes no actual character.  A posterior handling of the predicted tokens is performed to remove token duplicates in sequence, which is necessary when using CTC. Note that the model detected correctly the two keywords, and have not detected one of the fillers. This represents error though, since we are only interested in the keywords.

<img src="https://github.com/ryuuji06/keyword-spotting/blob/main/examples/figures/prediction_example_2.png" width="600">

The next figure illustrates the real-time application. It captures the instant when I said "no one left the house". Note that it did not detected the keyword "one". In this real-time application, the network seemed not to perform as well as the metrics above ensure.

<img src="https://github.com/ryuuji06/keyword-spotting/blob/main/examples/figures/online_test.png" width="400">

## References

[1] A. Graves, S. Fernandez, F. Gomez and J. Schmidhuber. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks". Proceedings of the 23rd International Conference on Machine Learning (ICML'06), p.369-376, 2006

[2] G. Chen, C. Parada, and G. Heigold, "Small-footprint keyword spotting using deep neural networks". Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), p. 4087–4091, 2014

[3] T.N. Sainath and C. Parada. "Convolutional neural networks for small-footprint keyword spotting". Proceedings of the INTERSPEECH, 2015, p. 1478–1482.

[4] S.O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner, R. Prenger and A. Coates. "Convolutional recurrent neural networks for small-footprint keyword spotting". Proceedings of the INTERSPEECH, 2017, pp. 1606–1610.

[5] S. Fernandez, A. Graves, and J. Schmidhuber. "An application of recurrent neural networks to discriminative keyword spotting". Proceedings of the International Conferencce on Artificial Neural Networks (ICANN), 2007, pp. 220–229.

[6] P. Warden. "Speech commands: a dataset for limited-vocabulary speech recognition". 2018. http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz

[7] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. "Librispeech: an ASR corpus based on public domain audio books". Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, p. 5206-5210. https://www.openslr.org/12

[8] Microsoft Scalable Noisy Speech Dataset. https://github.com/microsoft/MS-SNSD


##导言

在这个存储库中，我在Python/Keras中实现了一个用于检测语音信号中特定口语词的系统，这项任务称为**关键字识别**（KWS）。在读取语音信号时，我不仅检测关键字的存在，还检测关键字的时间位置。为此，我使用了一个CNN-RNN网络，带有一个CTC（连接时间分类）损失函数[1]。由于系统是因果的（我只使用单向LSTM，卷积运算是因果的，有一定的延迟），所以我还进行了实时实现。
传统上，关键字识别和其他语音检测任务通常使用隐马尔可夫模型（HMM）实现。自[2,3,4]以来，人们越来越关注用深度卷积或递归网络来解决这一问题，其精度更高，占用空间更小。然而，其中一些方法[2,3]取决于输入信号的每一帧作为关键字之一或非关键字帧的注释。在这些工作中，在HMM系统的预训练中执行注释，从中获得逐帧标记，然后可以在较少计算负担的情况下训练和预测深度神经网络。
在这项工作中，我使用了具有CTC损失函数[1]的CNN-RNN网络，如[5]所示。这种方法免除了对每个输入信号帧的标签需求，因为CTC前向-后向算法将输入序列与标签序列对齐，以最小化其损失函数。
对输入语音信号进行预处理，为每帧提取13个mel频率倒谱系数（MFCC）。MFCC优选用于语音识别任务，因为它很好地捕捉音素之间的区别，并且比直接使用频谱图更紧凑。
在这个测试中，我专门训练了模型来识别以下八个关键字（从公开数据集中任意选择）：“房子”、“右”、“下”、“左”、“否”、“五”、“一”和“三”。为了训练网络，我使用谷歌语音命令数据集[6]的一个子集，其中包含这些关键字的独立语音数据，为了训练整个口语句子中关键字的检测，我使用Librispeech数据集[7]。
由于Librispeech中关键字的出现率非常低，网络几乎总是被诱导预测非关键字，从而产生不希望的行为。为了解决这个问题，我限制了不包含任何关键字的Librispeech输入样本的数量，并创建了第三个数据集：我将语音命令中的多个语音信号连接起来，无论是否包含关键字，以形成在同一信号样本中出现多个关键字的序列。数据集的这种额外修改大大提高了模型的预测性能。
##使用说明
###（1）安装所需的Python软件包
通过运行以下程序安装所需的软件包：
`pip安装-r requirements.txt`
如果您在Windows中，安装PyAudio可能会遇到问题（仅实时实现需要）。如果是这种情况，您必须下载相应的`。直接从PyAudio存储库中获取whl`文件，如中所述
https://stackoverflow.com/questions/52283840/i-cant-install-pyaudio-on-windows-how-to-solve-error-microsoft-visual-c-14
###（2）下载数据集
首先，下载用于训练模型的数据集。任何数据集都可以使用，尽管使用谷歌语音命令[6]和LibriSpeech[7]更容易（我使用了“train-clean-100.tar.gz”子集），因为我已经实现了在“functions_datasets.py”中读取这些特定数据集的函数。
或者，您还可以下载附加背景噪声的数据集。我在测试中使用了MS-SNSD[8]（“噪声序列”）。
（这种额外的噪声将直观地提高检测的鲁棒性。然而，在我的测试中，它没有改善
评估指标，甚至降低了我用麦克风测试的在线性能。）
###（3）将LibriSpeech文件转换为WAV
LibriSpeech文件最初是FLAC格式的。由于我使用Tensorflow音频解码器，文件必须是WAV格式。快走
`flac2wav。sh`到dataset文件夹并运行它（它使用
来自libav的avconv程序http://builds.libav.org/windows/nightly-gpl/，以防您还没有）。如果您在Unix系统中，
快跑
`/flac2wav.sh`
或者，如果您在Windows中，
`powershell./flac2wav.sh`
###（4）准备数据集
为了提高性能，我从谷歌语音命令中为每个音频样本生成了一个额外的数据集，其中包含多个关键字。
您可以通过运行“create_dataset.py”来生成它。在代码的第一行中，您可以编辑一些设置，
要使用的关键字、每个关键字的样本数等。
当训练或评估模型时，代码读取具有先前选择的音频文件的路径和转录的文件。
通过运行'prepare_datasets'生成此文件。py`，它还生成一个包含数据集某些信息的文件，
例如每个关键字的样本数。您可以编辑某些设置，例如数据集路径、选定关键字和示例限制
在代码的第一行中。
###（5）列车模型
要训练模型，只需运行“kws_train.py”。通过解析设置训练参数
`python kws_train.py -d data2 -r result_01 -m 2 --train_test_split 0.7 --epochs 50`
其中，“data”是由“prepare_datasets.py”生成的文件夹，`result_01`是存储结果文件（模型权重和历史记录）的文件夹，2是所选的网络模型，数据集的0.7部分作为训练集，训练运行50个时期。其他参数可以在“kws_train.py”中的代码中进行验证。
目前，只有少数非参数内置网络模型，在“模型”中定义。py`，它们都具有卷积层和循环层。其中一些使用Conv1D，打算与MFCC功能一起使用，另一些使用Conv2D，打算与mel光谱图功能一起使用。
###（6）评价和预测
为了评估模型，我计算每个测试样本中每个关键字的正确检测数（真阳性）、假阳性和假阴性数，并计算精度、召回率和F1分数。然后计算模型的全局度量作为每个关键字的度量的平均值。
要评估模型，请运行“kws_evaluate”。例如，通过
`python kws_evaluate.py -d data -r result_01`
它仅使用在参考结果文件夹的训练期间被视为验证的数据部分。它还生成一个“performance.txt`在记录性能结果的结果文件夹中。
为了可视化网络输出，可以运行“kws_predict.py`，对于数据集的某些选定样本，它使用Matplotlib打印音频信号和网络输出令牌概率，以及标签令牌序列和预测序列。
###（7）实时执行
我还对网络进行了实时实现，使用PyAudio从麦克风设备捕获音频流。它可以作为例如，
`python kws_realtime.py -r result_01`
其使用存储在结果文件夹中的模型参数运行。目前，它只使用型号2，这显然是最好的结果。
##关于模型
我测试了一些不同的网络模型，但它们都有一个相似的概念，即使用一些卷积层，然后是循环层。在下文中，我将介绍型号2，它在测试中给出了最佳结果。
-MFCC特征提取；
-批量归一化；
-下面的卷积块，具有N＝32个滤波器；
-下面的卷积块，具有N＝64个滤波器；
-下面的卷积块，具有N＝128个滤波器；
-单向LSTM层，128个单元，辍学率为0.25；
-单向LSTM层，128个单元，辍学率为0.25；
-密集层，具有10个单元（8个关键字令牌、1个非关键字令牌和CTC空令牌），具有softmax激活。
<img src=”https://github.com/ryuuji06/keyword-spotting/blob/main/examples/figures/block_diagram_conv1.png“width=”150“>

##样本结果
下图说明了语音信号及其特征（mel频谱图和MFCC）。这是口语中“左”字的一个例子。
<img src=”https://github.com/ryuuji06/keyword-spotting/blob/main/examples/figures/illust_features_1.png“width=”500“>
下图显示了“示例”文件夹中模型的学习曲线。请注意，模型2的性能最好。除“模型2噪声”外，所有模型均在无加性增强噪声的情况下进行训练，其中在50%的训练和验证样本中添加了背景噪声。
<img src=”https://github.com/ryuuji06/keyword-spotting/blob/main/examples/figures/learning_curves.png“width=”700“>
下表描述了每个模型的性能，考虑了精度、召回率和F1分数。通过计算每个关键字的真阳性、假阳性和假阴性检测总数，并取平均值来计算这些指标。
<img src=”https://github.com/ryuuji06/keyword-spotting/blob/main/examples/figures/performance_table.png“width=”350“>
下一幅图显示了从Librispeech输入样本信号时的网络输出（每个令牌的概率），其中包含以下句子：
`当然不是走三英里、四英里或五英里`
最后一个标记（黑色虚线）是空字符标记，是CTC算法固有的，不编码实际字符。对预测的令牌进行后处理，以按顺序移除令牌副本，这在使用CTC时是必要的。请注意，模型正确检测到了两个关键字，但没有检测到其中一个填充词。这代表了一个错误，因为我们只对关键字感兴趣。
<img src=”https://github.com/ryuuji06/keyword-spotting/blob/main/examples/figures/prediction_example_2.png“width=”600“>
下图说明了实时应用程序。它捕捉到了我说“没有人离开房子”的瞬间。请注意，它没有检测到关键字“一”。在这个实时应用程序中，网络的性能似乎不如上述指标。
<img src=”https://github.com/ryuuji06/keyword-spotting/blob/main/examples/figures/online_test.png“width=”400“>
##参考文献
[1] A.格雷夫斯、S.费尔南德斯、F.戈麦斯和J.施密杜伯。“连接主义时间分类：用递归神经网络标记未分段序列数据”。第23届国际机器学习会议记录（ICML'06），第369-3762006页
[2] G.Chen、C.Parada和G.Heigold，“使用深度神经网络的小足迹关键词识别”。IEEE声学、语音和信号处理国际会议记录（ICASSP），第4087-40912014页
[3] T.N.赛纳特和C.帕拉达。“用于小足迹关键词识别的卷积神经网络”。《语际会议录》，2015年，第1478-1482页。
[4] S.O.Arik、M.Kliegl、R.Child、J.Hestness、A.Gibiansky、C.Fougner、R.Prenger和A.Coates。“用于小足迹关键词识别的卷积递归神经网络”。《国际语音会议录》，2017年，第1606-1610页。
[5] 费尔南德斯、格雷夫斯和施密杜伯。“递归神经网络在区分关键词识别中的应用”。《人工神经网络国际会议记录》，2007年，第220-229页。
[6] 监狱长。“语音命令：用于有限词汇量语音识别的数据集”。2018http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz
[7] V.Panayotov、G.Chen、D.Povey和S.Khudanpur。“Librispeech：基于公共领域有声图书的ASR语料库”。IEEE声学、语音和信号处理国际会议记录，2015年，第5206-5210页。https://www.openslr.org/12
[8] Microsoft可扩展有噪语音数据集。https://github.com/microsoft/MS-SNSD