如何使用上下文窗口对整个日志梅尔频谱图进行分段(确保所有音频的分段数量相同)? [英] How to use a context window to segment a whole log Mel-spectrogram (ensuring the same number of segments for all the audios)?

查看:340
本文介绍了如何使用上下文窗口对整个日志梅尔频谱图进行分段(确保所有音频的分段数量相同)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个音频,音频的持续时间不同.因此,我不知道如何确保音频片段的N个相同.我正在尝试实施现有的论文,因此,据说首先使用25 ms的汉明窗和10 ms的重叠在整个音频中使用从20到8000 Hz的64个Mel滤波器组在整个音频中执行Log Mel-Spectrogram .然后,为了得到我有以下代码行:

I have several audios with different duration. So I don't know how to ensure the same number N of segments of the audio. I'm trying to implement an existing paper, so it's said that first a Log Mel-Spectrogram is performed in the whole audio with 64 Mel-filter banks from 20 to 8000 Hz, by using a 25 ms Hamming window and a 10 ms overlapping. Then, in order to get that I have the following code lines:

y, sr = librosa.load(audio_file, sr=None)
#sr = 22050
#len(y) = 237142
#duration = 5.377369614512472

n_mels = 64
n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter
win_length = int(np.ceil(0.025*sr)) # 0.025*22050
hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)
+ 1e-6)

# M.shape = (64, 532)

(我也不知道如何完成该n_fft参数.) 然后,说:

(Also I'm not sure how to complete that n_fft parameter.) Then, it's said:

使用64帧的上下文窗口划分整个日志 梅尔频谱图分为大小为64x64的音频段.移位大小为 分割期间使用30帧,即两个相邻的片段 与30帧重叠.因此,每个分割的段都有一个长度 64帧,其持续时间为10毫秒x(64-1)+ 25毫秒= 655毫秒.

Use a context window of 64 frames to divide the whole log Mel-spectrogram into audio segments with size 64x64. A shift size of 30 frames is used during the segmentation, i.e. two adjacent segments are overlapped with 30 frames. Each divided segment hence has a length of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.

因此,我被困在最后一部分,我不知道如何执行64x64的M分割.又如何才能为所有音频获得相同数量的片段(具有不同的持续时间),因为最终我将需要64x64xN的特征作为我的神经网络或分类器的输入?我将不胜感激!我是音频信号处理的初学者.

So, I'm stuck in this last part, I don't know how to perform the segmentation of M by 64x64. And how can I got the same numbers of segments for all the audios (with different duration), because at the final I will need 64x64xN features as input to my neural network or classifier? I will appreciate a lot any help! I'm a beginner with audio signal processing.

推荐答案

沿时间轴遍历帧,一次向前移动30帧,并提取最后64帧的窗口.在开始和结束时,您都需要截断或填充数据以获得完整的帧.

Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.

import librosa
import numpy as np
import math

audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds

n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)


window_size = 64
window_hop = 30

# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size 
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)

for frame_idx in range(start_frame, end_frame, window_hop):

    window = frames[:, frame_idx-window_size:frame_idx]
    assert window.shape == (n_mels, window_size)
    print('classify window', frame_idx, window.shape)

将输出

classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)

但是,窗口的数量将取决于音频样本的长度.因此,如果重要的是只有相同数量的窗口,则需要确保所有音频样本的长度均相同.

However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.

这篇关于如何使用上下文窗口对整个日志梅尔频谱图进行分段(确保所有音频的分段数量相同)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆