其他音频功能提取提示 [英] Additional audio feature extraction tips

查看:61
本文介绍了其他音频功能提取提示的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Keras创建语音情感识别模型,我已经完成了所有代码并训练了模型.它的验证率约为50%,并且过拟合.

I'm trying to create a speech emotion recognition model using Keras, I've done all of the code and have trained the model. It sits around 50% validation and is overfitting.

当我将Model.predict()与看不见的数据一起使用时,似乎很难区分中性",平静",快乐"和惊讶",但似乎能够正确预测愤怒"在大多数情况下-我认为是因为音高或其他方面存在明显差异.

When i use model.predict() with unseen data it seems to have a hard time distinguishing between 'neutral', 'calm', 'happy' and 'suprised', but seems to be able to predict 'angry' correctly in the majority of cases - i assume because there's a clear difference in pitch or something.

我想这可能是我没有从这些情绪中获得足够的功能,这将有助于模型区分它们.

I'm thinking it could possibly be that i'm not getting enough features from these emotions, which would help the model distinguish between them.

当前,我正在使用Librosa并将音频覆盖到MFCC.即使使用Librosa,我还有其他方法可以提取模型的特征以帮助其更好地区分中性",平静",快乐",惊讶"等吗?

Currently i am using Librosa and coverting audio to MFCC's. Is there any other way, even using Librosa, that i can do to extract features for the model to help it better distinguish between the 'neutral', 'calm', 'happy', 'surprised' etc?

一些特征提取代码:

wav_clip, sample_rate = librosa.load(file_path, duration=3, mono=True, sr=None)     
mfcc = librosa.feature.mfcc(wav_clip, sample_rate)

此外,这是1400个样本.

Also, this is with 1400 samples.

推荐答案

一些入门知识:

  • 很可能您的样本太少,无法有效地使用神经网络.使用入门的简单算法很好地了解您的模型如何进行预测.
  • 请确保您有足够(30%或更多)来自不同扬声器的样本放在一边进行最终测试.您只能使用一次此测试集,因此请考虑构建一个管道来生成训练,验证和测试集.确保不要将同一个扬声器放入多于一组的扬声器.
  • librosa 中的第一个系数为您提供AFAIK偏移量.我建议您绘制特征与标签之间的关联关系以及它们之间的重叠程度,我想其中的一些很容易混淆.查找是否有任何功能可以区分您的班级.不要通过运行模型来执行此操作,请先进行外观检查.
  • Likely you have far too few samples to efficiently use neural networks. Use a simple algorithm for starter to understand well how your model is making prediction.
  • Make sure you have enough (30% or more) samples from different speakers put aside for final testing. You can use this test set only once, so think about building a pipeline to generate train, validation and test sets. Make sure you don't put the same speaker into more than 1 set.
  • First coefficient from librosa gives you AFAIK an offset. I'd recommend plotting how your features correlate with labels and how far they overlap, some can be easily confused I guess. Find if there are any feature that would differentiate your classes. Don't do this by running your model, do visual inspection first.

以实际功能为准!您认为推销应该起至关重要的作用是正确的.我建议您查看 aubio -它具有Python绑定.

To the actual features! You're right to assume pitch should play a vital role. I'd recommend checking out aubio - it has Python bindings.

Yaafe 还提供了出色的功能选择.

Yaafe also offers excellent selection of features.

您可能会轻易获得150多种功能.您可能想降低问题的维数,甚至将其压缩到2d并查看是否可以以某种方式分离类.这里是我在Dash中的示例.

You might easily end up with 150+ features. You might want to reduce dimensionality of the problem, perhaps even compress it to 2d and see if you can somehow separate the classes. Here is my own example with Dash.

最后但并非最不重要的一点是,一些基本代码可从音频中提取频率.在这种情况下,我还要尝试找到三个峰值频率.

Last but not least, some basic code to extract frequencies from the audio. In this case I am also trying to find three peak frequencies.

import numpy as np

def spectral_statistics(y: np.ndarray, fs: int, lowcut: int = 0) -> dict:
    """
    Compute selected statistical properties of spectrum
    :param y: 1-d signsl
    :param fs: sampling frequency [Hz]
    :param lowcut: lowest frequency [Hz]
    :return: spectral features (dict)
    """
    spec = np.abs(np.fft.rfft(y))
    freq = np.fft.rfftfreq(len(y), d=1 / fs)
    idx = int(lowcut / fs * len(freq) * 2)
    spec = np.abs(spec[idx:])
    freq = freq[idx:]

    amp = spec / spec.sum()
    mean = (freq * amp).sum()
    sd = np.sqrt(np.sum(amp * ((freq - mean) ** 2)))
    amp_cumsum = np.cumsum(amp)
    median = freq[len(amp_cumsum[amp_cumsum <= 0.5]) + 1]
    mode = freq[amp.argmax()]
    Q25 = freq[len(amp_cumsum[amp_cumsum <= 0.25]) + 1]
    Q75 = freq[len(amp_cumsum[amp_cumsum <= 0.75]) + 1]
    IQR = Q75 - Q25
    z = amp - amp.mean()
    w = amp.std()
    skew = ((z ** 3).sum() / (len(spec) - 1)) / w ** 3
    kurt = ((z ** 4).sum() / (len(spec) - 1)) / w ** 4

    top_peaks_ordered_by_power = {'stat_freq_peak_by_power_1': 0, 'stat_freq_peak_by_power_2': 0, 'stat_freq_peak_by_power_3': 0}
    top_peaks_ordered_by_order = {'stat_freq_peak_by_order_1': 0, 'stat_freq_peak_by_order_2': 0, 'stat_freq_peak_by_order_3': 0}
    amp_smooth = signal.medfilt(amp, kernel_size=15)
    peaks, height_d = signal.find_peaks(amp_smooth, distance=100, height=0.002)
    if peaks.size != 0:
        peak_f = freq[peaks]
        for peak, peak_name in zip(peak_f, top_peaks_ordered_by_order.keys()):
            top_peaks_ordered_by_order[peak_name] = peak

        idx_three_top_peaks = height_d['peak_heights'].argsort()[-3:][::-1]
        top_3_freq = peak_f[idx_three_top_peaks]
        for peak, peak_name in zip(top_3_freq, top_peaks_ordered_by_power.keys()):
            top_peaks_ordered_by_power[peak_name] = peak

    specprops = {
        'stat_mean': mean,
        'stat_sd': sd,
        'stat_median': median,
        'stat_mode': mode,
        'stat_Q25': Q25,
        'stat_Q75': Q75,
        'stat_IQR': IQR,
        'stat_skew': skew,
        'stat_kurt': kurt
    }
    specprops.update(top_peaks_ordered_by_power)
    specprops.update(top_peaks_ordered_by_order)
    return specprops

这篇关于其他音频功能提取提示的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆