建立用于LSTM二进制分类的语音数据集 [英] Building Speech Dataset for LSTM binary classification

查看：334 发布时间：2020/5/4 6:19:44 python-2.7 speech-recognition theano mfcc lstm

本文介绍了建立用于LSTM二进制分类的语音数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用theano进行二进制LSTM分类. 我已经看过示例代码，但是我想构建自己的代码.

我有一小部分你好"&我正在使用的再见"录音.我通过提取它们的MFCC功能并将这些功能保存在文本文件中来对其进行预处理.我有20个语音文件(每个10个)，并且每个单词都在生成一个文本文件，因此有20个包含MFCC功能的文本文件.每个文件都是13x56的矩阵.

我现在的问题是:如何使用此文本文件训练LSTM?

对此我比较陌生.我也阅读了一些有关它的文献，但对这个概念并没有真正的了解.

也欢迎使用LSTM的任何更简单的方法.

解决方案

有许多现有的实现，例如 Tensorflow实现，针对所有脚本的以Kaldi为重点的实现，最好先进行检查. >

Theano的级别太低，您可以尝试使用 keras ，如教程.您可以按原样运行教程，以了解事情的发展.

然后，您需要准备一个数据集.您需要将数据转换为数据帧序列，并且需要按顺序为每个数据帧分配输出标签.

Keras支持两种类型的RNN-返回序列的层和返回简单值的层.您可以对两者进行试验，只需在代码中使用return_sequences=True或return_sequences=False

要训练序列，您可以为所有帧分配虚拟标签，最后一个除外，在那一帧您可以分配要识别的单词的标签.您需要将输入和输出标签放置到数组中.因此它将是:

X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]

Y = [[0,0,...,1], [0,0,....,2]]

在X中，每个元素都是13个浮点的向量.在Y中，每个元素只是一个数字-中间帧为0，最后一个帧为单词ID.

要仅使用标签进行训练，您需要将输入和输出标签放置到数组中，而输出数组则更简单.因此数据将是:

X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]

Y = [[0,0,1], [0,1,0]]

请注意，输出是矢量化的(np_utils.to_categorical)，可以将其转换为矢量，而不仅仅是数字.

然后创建网络体系结构.输入可以有13个浮点数，输出可以有一个向量.在中间，您可能有一个完全连接的层，然后是一个lstm层.不要使用太大的图层，而要从小的图层开始.

然后将数据集输入到model.fit中，它将训练您的模型.您可以在训练后根据保留的模型估算模型的质量.

由于只有20个示例，因此您将在收敛方面遇到问题.您需要更多的示例，最好是数千个示例，以训练LSTM，您将只能使用非常小的模型.

I'm trying to do binary LSTM classification using theano. I have gone through the example code however I want to build my own.

I have a small set of "Hello" & "Goodbye" recordings that I am using. I preprocess these by extracting the MFCC features for them and saving these features in a text file. I have 20 speech files(10 each) and I am generating a text file for each word, so 20 text files that contains the MFCC features. Each file is a 13x56 matrix.

My problem now is: How do I use this text file to train the LSTM?

I am relatively new to this. I have gone through some literature on it as well but not found really good understanding of the concept.

Any simpler way using LSTM's would also be welcome.

解决方案

There are many existing implementation for example Tensorflow Implementation, Kaldi-focused implementation with all the scripts, it is better to check them first.

Theano is too low-level, you might try with keras instead, as described in tutorial. You can run tutorial "as is" to understand how things goes.

Then, you need to prepare a dataset. You need to turn your data into sequences of data frames and for every data frame in sequence you need to assign an output label.

Keras supports two types of RNNs - layers returning sequences and layers returning simple values. You can experiment with both, in code you just use return_sequences=True or return_sequences=False

To train with sequences you can assign dummy label for all frames except the last one where you can assign the label of the word you want to recognize. You need to place input and output labels to arrays. So it will be:

X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]

Y = [[0,0,...,1], [0,0,....,2]]

In X every element is a vector of 13 floats. In Y every element is just a number - 0 for intermediate frames and word ID for final frame.

To train with just labels you need to place input and output labels to arrays and output array is simpler. So the data will be:

X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]

Y = [[0,0,1], [0,1,0]]

Note that output is vectorized (np_utils.to_categorical) to turn it to vectors instead of just numbers.

Then you create network architecture. You can have 13 floats for input, a vector for output. In the middle you might have one fully connected layer followed by one lstm layer. Do not use too big layers, start with small ones.

Then you feed this dataset into model.fit and it trains you the model. You can estimate model quality on heldout set after training.

You will have a problem with convergence since you have just 20 examples. You need way more examples, preferably thousands to train LSTM, you will only be able to use very small models.

这篇关于建立用于LSTM二进制分类的语音数据集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

建立用于LSTM二进制分类的语音数据集 [英] Building Speech Dataset for LSTM binary classification

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

建立用于LSTM二进制分类的语音数据集 [英] Building Speech Dataset for LSTM binary classification

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭