如何将mfcc向量与标注中的标签组合以传递到神经网络 [英] How to combine mfcc vector with labels from annotation to pass to a neural network

查看:247
本文介绍了如何将mfcc向量与标注中的标签组合以传递到神经网络的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用librosa,我为音频文件创建了mfcc,如下所示:

Using librosa, I created mfcc for my audio file as follows:

import librosa
y, sr = librosa.load('myfile.wav')
print y
print sr
mfcc=librosa.feature.mfcc(y=y, sr=sr)

我还有一个文本文件,其中包含与音频相对应的手动注释[开始,停止,标记],如下所示:

I also have a text file that contains manual annotations[start, stop, tag] corresponding to the audio as follows:

0.0 2.0声音1
2.0 4.0声音2
4.0 6.0沉默
6.0 8.0声音1

0.0 2.0 sound1
2.0 4.0 sound2
4.0 6.0 silence
6.0 8.0 sound1

问题: 如何将librosa生成的mfcc与文本文件中的注释结合在一起.

QUESTION: How to do I combine the generated mfcc's that was generated by librosa, with the annotations from text file.

最终目标是,我想将与标签相对应的mfcc合并,然后通过
它到神经网络.
因此,神经网络将具有mfcc和相应的标签作为训练数据.

Final goal is, I want to combine mfcc corresponding to the label, and pass
it to a neural network.
So a neural network will have the mfcc and corresponding label as training data.

如果它是一维的,则我可以有N列具有N个值,最后一个Y列具有Class标签. 但是我很困惑如何继续,因为MFCC具有类似 (16,X)或 (20,Y). 所以我不知道如何将两者结合起来.

If it was one dimensional , I could have N columns with N values and the final Column Y with a Class label. But i'm confused how to proceed, as the mfcc has the shape of something like (16, X) or (20, Y). So I don't know how to combine the two.

我的示例mfcc在这里: https://gist.github.com/manbharae/0a53f8dfef6055feef18d891204414 a>

My sample mfcc's are here : https://gist.github.com/manbharae/0a53f8dfef6055feef1d8912044e1418

请帮助谢谢.

更新:目标是训练神经网络,以便将来遇到新声音时可以识别出新声音.

Update : Objective is to train a neural network so that it can identify a new sound when it encounters it in the future.

我在Google上搜索后发现,MFCC非常适合讲话.但是我的音频中有语音,但我想识别非语音.通用音频分类/识别任务是否还有其他推荐的音频功能?

I googled and found that mfcc are very good for speech. However my audio has speech but I want to indentify non speech. Are there any other recommended audio features for a general purpose audio classification/recognition task?

推荐答案

请尝试以下操作.解释包含在代码中.

Try the following. The explanation is included in the code.

import numpy
import librosa

# The following function returns a label index for a point in time (tp)
# this is psuedo code for you to complete
def getLabelIndexForTime(tp):
    # search the loaded annoations for what label corresponsons to the given time
    # convert the label to an index that represents its unqiue value in the set
    # ie.. 'sound1' = 0, 'sound2' = 1, ...
    #print tp  #for debug
    label_index = 0 #replace with logic above
    return label_index


if __name__ == '__main__':
    # Load the waveforms samples and convert to mfcc
    raw_samples, sample_rate = librosa.load('Front_Right.wav')
    mfcc  = librosa.feature.mfcc(y=raw_samples, sr=sample_rate)
    print 'Wave duration is %4.2f seconds' % (len(raw_samples)/float(sample_rate))

    # Create the network's input training data, X
    # mfcc is organized (feature, sample) but the net needs (sample, feature)
    # X is mfcc reorganized to (sample, feature)
    X     = numpy.moveaxis(mfcc, 1, 0)
    print 'mfcc.shape:', mfcc.shape
    print 'X.shape:   ', X.shape

    # Note that 512 samples is the default 'hop_length' used in calculating 
    # the mfcc so each mfcc spans 512/sample_rate seconds.
    mfcc_samples = mfcc.shape[1]
    mfcc_span    = 512/float(sample_rate)
    print 'MFCC calculated duration is %4.2f seconds' % (mfcc_span*mfcc_samples)

    # for 'n' network input samples, calculate the time point where they occur
    # and get the appropriate label index for them.
    # Use +0.5 to get the middle of the mfcc's point in time.
    Y = []
    for sample_num in xrange(mfcc_samples):
        time_point = (sample_num + 0.5) * mfcc_span
        label_index = getLabelIndexForTime(time_point)
        Y.append(label_index)
    Y = numpy.array(Y)

    # Y now contains the network's output training values
    # !Note for some nets you may need to convert this to one-hot format
    print 'Y.shape:   ', Y.shape
    assert Y.shape[0] == X.shape[0] # X and Y have the same number of samples

    # Train the net with something like...
    # model.fit(X, Y, ...   #ie.. for a Keras NN model

我应该提到,这里的Y数据旨在用于具有softmax输出的网络,该网络可以使用整数标签数据进行训练. Keras模型使用sparse_categorical_crossentropy损失函数来接受这一点(我相信损失函数会在内部将其转换为单热编码).其他框架要求Y训练标签以单热编码格式传递.这是更常见的.关于如何进行转换有很多示例.对于您的情况,您可以做类似...

I should mention that here the Y data is intended to be used in a network that has a softmax output that can be trained with integer label data. Keras models accept this with a sparse_categorical_crossentropy loss function (I believe the loss function internally converts it to one-hot encoding). Other frameworks require the Y training labels to be delivered alreading in one-hot encoding format. This is more common. There's lots of examples on how to do the conversion. For your case you could do something like...

Yoh = numpy.zeros(shape=(Y.shape[0], num_label_types), dtype='float32')
for i, val in enumerate(Y):
    Yoh[i, val] = 1.0

至于mfcc对于非语音分类是可以接受的,我希望它们可以工作,但是您可能想要尝试修改其参数,即,.librosa允许您执行类似n_mfcc=40的操作,因此您可以获得40个功能,而不仅仅是20.有趣的是,您可以尝试使用相同大小的简单FFT(512个样本)替换mfcc,然后查看哪种效果最好.

As for mfcc's being acceptable for classifying non-speech, I would expect them to work but you may want to try modifying their parameters, ie.. librosa allows you do something like n_mfcc=40 so you get 40 features instead of just 20. For fun, you might try replacing the mfcc with a simple FFT of the same size (512 samples) and see which works the best.

这篇关于如何将mfcc向量与标注中的标签组合以传递到神经网络的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆