从头开始简单的语音识别 [英] Simple speech recognition from scratch

查看:84
本文介绍了从头开始简单的语音识别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现与我的问题最相似的问题是这个(简单的语音识别方法),但由于已经过去3年了,所以我问的答案还不够.

The most alike question I found related to my question is this (simple speech recognition methods) but since had passed 3 years and the answers are not enough I will ask.

我想从头开始计算一个简单的语音识别系统,我只需要识别五个单词即可.据我所知,此应用程序最常用的音频功能是MFCC和用于分类的HMM.

I want to compute, from scratch, a simple speech recognition system, I only need to recognize five words. As much as I know, the more used audio features for this application are the MFCC, and HMM for classification.

我能够从音频中提取MFCC,但对于如何使用这些功能通过HMM生成模型然后进行分类,我仍然存有疑问.

I'm able to extract the MFCC from audio but I still have some doubts about how to use the features for generating a model with HMM and then perform classification.

据我了解,我必须执行矢量量化.首先,我需要有一堆MFCC向量,然后应用聚类算法来获取质心.然后,使用质心执行矢量量化,这意味着我必须比较每个MFCC矢量,并以最相似的质心名称对其进行标记.

As I understand, I have to perform vector quantization. First I need to have a bunch of MFCC vectors, then apply a clustering algorithm to get centroids. Then, use the centroids to perform vector quantization, this means that I have to compare every MFCC vector and label it with the name of the centroid most alike.

然后,质心是HMM中的可观察符号".我必须将单词介绍给训练算法,并为每个单词创建一个HMM模型.然后,在进行音频查询时,我将所有模型进行比较,然后说这是概率最高的单词.

Then, the centroids are the 'observable symbols' in the HMM. I have to introduce words to the training algorithm and create a HMM model for each word. Then, given an audio query I compare with all models and I say is the word with the highest probability.

首先,此过程是否正确?然后,如何处理大小不同的单词.我的意思是,如果我训练了500ms和300ms的单词,我会引入多少个可观察符号以与所有模型进行比较?

First of all, is this procedure correct? Then, how do I deal with different sized words. I mean, If I have trained words of 500ms and 300ms, how many observable symbols do I introduce to compare with all the models?

注意:我不想使用狮身人面像,android API,microsoft API或其他库.

Note: I don't want to use sphinx, android API, microsoft API or other library.

注意2:如果您共享更多最新信息以寻求更好的技术,我将不胜感激.

Note2: I would appreciate if you share more recent information for better techniques.

推荐答案

首先,此过程正确吗?

First of all, is this procedure correct?

矢量量化部分还可以,但是最近很少使用.您描述了没人使用的所谓离散HMM.如果您希望使用GMM作为排放的概率分布的连续HMM,则不需要矢量量化.

The vector quantization part is ok, but it's rarely used these days. You describe so-called discrete HMMs which nobody uses for speech. If you want continuous HMMs with GMM as probability distribution for emissions you don't need vector quantization.

然后,您专注于不太重要的步骤,例如MFCC提取,但是跳过了最重要的部分,例如使用Baum-Welch进行HMM训练和使用Viterbi进行HMM解码,这比通过矢量量化对状态的初始估计要复杂得多.

Then, you focused on less important steps like MFCC extraction but skipped most important parts like HMM training with Baum-Welch and HMM decoding with Viterbi which are way more complex part of the training than initial estimation of the states with vector quantization.

然后,我该如何处理大小不同的单词.我的意思是,如果我训练了500ms和300ms的单词,我会引入多少个可观察符号以与所有模型进行比较?

Then, how do I deal with different sized words. I mean, If I have trained words of 500ms and 300ms, how many observable symbols do I introduce to compare with all the models?

如果解码语音,通常会选择与人类感知的部分音素相对应的符号.传统上每个音素采用3个符号.例如,单词一个"应具有3个音素的9个状态,而单词七个"应具有5个音素的15个状态.事实证明,这种做法是有效的.当然,您可以略微改变此估算值.

If you decode speech you usually select the symbols which correspond to parts phonemes perceived by the human. Its traditional to take 3 symbols per phoneme. For example word "one" should have 9 states for 3 phonemes and word "seven" should have 15 states for 5 phonemes. This practice is proven to be effective. Of course you can vary this estimation slightly.

这篇关于从头开始简单的语音识别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆