MFCC Python:与librosa vs python_speech_features vs tensorflow.signal的结果完全不同 [英] MFCC Python: completely different result from librosa vs python_speech_features vs tensorflow.signal

查看:92
本文介绍了MFCC Python:与librosa vs python_speech_features vs tensorflow.signal的结果完全不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从音频(.wav文件)中提取MFCC功能,并且尝试了 python_speech_features librosa ,但它们给出的结果完全不同:

  audio,sr = librosa.load(file,sr = None)#librosahop_length = int(sr/100)n_fft = int(sr/40)features_librosa = librosa.feature.mfcc(音频,sr,n_mfcc = 13,hop_length = hop_length,n_fft = n_fft)#每方呎features_psf = mfcc(音频,sr,numcep = 13,winlen = 0.025,winstep = 0.01) 

以下是情节:

librosa :

python_speech_features :

对于这两种方法,我是否传递了错误的参数?为什么这里有如此大的差异?

更新:我也尝试过tensorflow.signal实现,结果如下:

情节本身更接近librosa的情节,但比例更接近python_speech_features.(请注意,这里我计算了80个mel箱,并采用了前13个mel箱;如果仅用13个箱进行了计算,结果看起来也大不相同).下面的代码:

  stfts = tf.signal.stft(音频,frame_length = n_fft,frame_step = hop_length,fft_length = 512)频谱图= tf.abs(stfts)num_spectrogram_bins = stfts.shape [-1]lower_edge_hertz,upper_edge_hertz,num_mel_bins = 80.0,7600.0,80linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(num_mel_bins,num_spectrogram_bins,sr,lower_edge_hertz,upper_edge_hertz)mel_spectrograms = tf.tensordot(频谱图,linear_to_mel_weight_matrix,1)mel_spectrograms.set_shape(spectrograms.shape [:-1] .concatenate(linear_to_mel_weight_matrix.shape [-1:]))log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)features_tf = tf.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrograms)[...,:13]features_tf = np.array(features_tf).T 

我认为我的问题是:哪个输出更接近MFCC的实际外观?

解决方案

这里至少有两个因素在起作用,可以解释为什么您获得不同的结果:

  1. 没有对梅尔音阶的单一定义. Librosa 实现两种方式:

    您可以看到比例尺不同,但是总体看起来确实很相似.请注意,我必须确保传递给模块的许多参数是相同的.

    I'm trying to do extract MFCC features from audio (.wav file) and I have tried python_speech_features and librosa but they are giving completely different results:

    audio, sr = librosa.load(file, sr=None)
    
    # librosa
    hop_length = int(sr/100)
    n_fft = int(sr/40)
    features_librosa = librosa.feature.mfcc(audio, sr, n_mfcc=13, hop_length=hop_length, n_fft=n_fft)
    
    # psf
    features_psf = mfcc(audio, sr, numcep=13, winlen=0.025, winstep=0.01)
    

    Below are the plots:

    librosa:

    python_speech_features:

    Did I pass any parameters wrong for those two methods? Why there's such a huge difference here?

    Update: I have also tried tensorflow.signal implementation, and here's the result:

    The plot itself matches closer to the one from librosa, but the scale is closer to python_speech_features. (Note that here I calculated 80 mel bins and took the first 13; if I do the calculation with only 13 bins, the result looks quite different as well). Code below:

    stfts = tf.signal.stft(audio, frame_length=n_fft, frame_step=hop_length, fft_length=512)
    spectrograms = tf.abs(stfts)
    
    num_spectrogram_bins = stfts.shape[-1]
    lower_edge_hertz, upper_edge_hertz, num_mel_bins = 80.0, 7600.0, 80
    linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
        num_mel_bins, num_spectrogram_bins, sr, lower_edge_hertz, upper_edge_hertz)
    mel_spectrograms = tf.tensordot(spectrograms, linear_to_mel_weight_matrix, 1)
    mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:]))
    
    log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)
    features_tf = tf.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrograms)[..., :13]
    features_tf = np.array(features_tf).T
    

    I think my question is: which output is closer to what MFCC actually looks like?

    解决方案

    There are at least two factors at play here that explain why you get different results:

    1. There is no single definition of the mel scale. Librosa implement two ways: Slaney and HTK. Other packages might and will use different definitions, leading to different results. That being said, overall picture should be similar. That leads us to the second issue...
    2. python_speech_features by default puts energy as first (index zero) coefficient (appendEnergy is True by default), meaning that when you ask for e.g. 13 MFCC, you effectively get 12 + 1.

    In other words, you were not comparing 13 librosa vs 13 python_speech_features coefficients, but rather 13 vs 12. The energy can be of different magnitude and therefore produce quite different picture due to the different colour scale.

    I will now demonstrate how both modules can produce similar results:

    import librosa
    import python_speech_features
    import matplotlib.pyplot as plt
    from scipy.signal.windows import hann
    import seaborn as sns
    
    n_mfcc = 13
    n_mels = 40
    n_fft = 512 
    hop_length = 160
    fmin = 0
    fmax = None
    sr = 16000
    y, sr = librosa.load(librosa.util.example_audio_file(), sr=sr, duration=5,offset=30)
    
    mfcc_librosa = librosa.feature.mfcc(y=y, sr=sr, n_fft=n_fft,
                                        n_mfcc=n_mfcc, n_mels=n_mels,
                                        hop_length=hop_length,
                                        fmin=fmin, fmax=fmax, htk=False)
    
    mfcc_speech = python_speech_features.mfcc(signal=y, samplerate=sr, winlen=n_fft / sr, winstep=hop_length / sr,
                                              numcep=n_mfcc, nfilt=n_mels, nfft=n_fft, lowfreq=fmin, highfreq=fmax,
                                              preemph=0.0, ceplifter=0, appendEnergy=False, winfunc=hann)
    

    As you can see the scale is different, but overall picture looks really similar. Note that I had to make sure that a number of parameters passed to the modules is the same.

    这篇关于MFCC Python:与librosa vs python_speech_features vs tensorflow.signal的结果完全不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆