从头开始简单的语音识别 [英] Simple speech recognition from scratch
问题描述
我发现与我的问题最相似的问题是这个(简单的语音识别方法),但由于已经过去3年了,所以我问的答案还不够.
The most alike question I found related to my question is this (simple speech recognition methods) but since had passed 3 years and the answers are not enough I will ask.
我想从头开始计算一个简单的语音识别系统,我只需要识别五个单词即可.据我所知,此应用程序最常用的音频功能是MFCC和用于分类的HMM.
I want to compute, from scratch, a simple speech recognition system, I only need to recognize five words. As much as I know, the more used audio features for this application are the MFCC, and HMM for classification.
我能够从音频中提取MFCC,但对于如何使用这些功能通过HMM生成模型然后进行分类,我仍然存有疑问.
I'm able to extract the MFCC from audio but I still have some doubts about how to use the features for generating a model with HMM and then perform classification.
据我了解,我必须执行矢量量化.首先,我需要有一堆MFCC向量,然后应用聚类算法来获取质心.然后,使用质心执行矢量量化,这意味着我必须比较每个MFCC矢量,并以最相似的质心名称对其进行标记.
As I understand, I have to perform vector quantization. First I need to have a bunch of MFCC vectors, then apply a clustering algorithm to get centroids. Then, use the centroids to perform vector quantization, this means that I have to compare every MFCC vector and label it with the name of the centroid most alike.
然后,质心是HMM中的可观察符号".我必须将单词介绍给训练算法,并为每个单词创建一个HMM模型.然后,在进行音频查询时,我将所有模型进行比较,然后说这是概率最高的单词.
Then, the centroids are the 'observable symbols' in the HMM. I have to introduce words to the training algorithm and create a HMM model for each word. Then, given an audio query I compare with all models and I say is the word with the highest probability.
首先,此过程是否正确?然后,如何处理大小不同的单词.我的意思是,如果我训练了500ms和300ms的单词,我会引入多少个可观察符号以与所有模型进行比较?
First of all, is this procedure correct? Then, how do I deal with different sized words. I mean, If I have trained words of 500ms and 300ms, how many observable symbols do I introduce to compare with all the models?
注意:我不想使用狮身人面像,android API,microsoft API或其他库.
Note: I don't want to use sphinx, android API, microsoft API or other library.
注意2:如果您共享更多最新信息以寻求更好的技术,我将不胜感激.
Note2: I would appreciate if you share more recent information for better techniques.
推荐答案
首先,此过程正确吗?
First of all, is this procedure correct?
矢量量化部分还可以,但是最近很少使用.您描述了没人使用的所谓离散HMM.如果您希望使用GMM作为排放的概率分布的连续HMM,则不需要矢量量化.
The vector quantization part is ok, but it's rarely used these days. You describe so-called discrete HMMs which nobody uses for speech. If you want continuous HMMs with GMM as probability distribution for emissions you don't need vector quantization.
然后,您专注于不太重要的步骤,例如MFCC提取,但是跳过了最重要的部分,例如使用Baum-Welch进行HMM训练和使用Viterbi进行HMM解码,这比通过矢量量化对状态的初始估计要复杂得多.
Then, you focused on less important steps like MFCC extraction but skipped most important parts like HMM training with Baum-Welch and HMM decoding with Viterbi which are way more complex part of the training than initial estimation of the states with vector quantization.
然后,我该如何处理大小不同的单词.我的意思是,如果我训练了500ms和300ms的单词,我会引入多少个可观察符号以与所有模型进行比较?
Then, how do I deal with different sized words. I mean, If I have trained words of 500ms and 300ms, how many observable symbols do I introduce to compare with all the models?
如果解码语音,通常会选择与人类感知的部分音素相对应的符号.传统上每个音素采用3个符号.例如,单词一个"应具有3个音素的9个状态,而单词七个"应具有5个音素的15个状态.事实证明,这种做法是有效的.当然,您可以略微改变此估算值.
If you decode speech you usually select the symbols which correspond to parts phonemes perceived by the human. Its traditional to take 3 symbols per phoneme. For example word "one" should have 9 states for 3 phonemes and word "seven" should have 15 states for 5 phonemes. This practice is proven to be effective. Of course you can vary this estimation slightly.
这篇关于从头开始简单的语音识别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!