如何使用上下文窗口来分割整个日志梅尔谱图(确保所有音频的段数相同)? [英] How to use a context window to segment a whole log Mel-spectrogram (ensuring the same number of segments for all the audios)?

查看:19
本文介绍了如何使用上下文窗口来分割整个日志梅尔谱图(确保所有音频的段数相同)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个不同时长的音频.所以我不知道如何确保音频的段数 N 相同.我正在尝试实现现有的论文,因此据说首先通过使用 25 ms 汉明窗口和 10 ms 重叠,在整个音频中使用 64 个 Mel 滤波器组从 20 到 8000 Hz 执行对数梅尔谱图.然后,为了得到我有以下代码行:

I have several audios with different duration. So I don't know how to ensure the same number N of segments of the audio. I'm trying to implement an existing paper, so it's said that first a Log Mel-Spectrogram is performed in the whole audio with 64 Mel-filter banks from 20 to 8000 Hz, by using a 25 ms Hamming window and a 10 ms overlapping. Then, in order to get that I have the following code lines:

y, sr = librosa.load(audio_file, sr=None)
#sr = 22050
#len(y) = 237142
#duration = 5.377369614512472

n_mels = 64
n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter
win_length = int(np.ceil(0.025*sr)) # 0.025*22050
hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)
+ 1e-6)

# M.shape = (64, 532)

(我也不知道如何完成 n_fft 参数.)然后,它说:

(Also I'm not sure how to complete that n_fft parameter.) Then, it's said:

使用64帧的上下文窗口来划分整个日志Mel 频谱图到大小为 64x64 的音频段.移位大小为分割过程中使用30帧,即两个相邻的片段与 30 帧重叠.每个分割的段因此有一个长度64 帧,其持续时间为 10 ms x (64-1) + 25 ms = 655 ms.

Use a context window of 64 frames to divide the whole log Mel-spectrogram into audio segments with size 64x64. A shift size of 30 frames is used during the segmentation, i.e. two adjacent segments are overlapped with 30 frames. Each divided segment hence has a length of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.

所以,我被困在最后一部分,我不知道如何按 64x64 执行 M 的分割.我怎样才能为所有音频(具有不同的持续时间)获得相同数量的片段,因为在最后我需要 64x64xN 特征作为我的神经网络或分类器的输入?我将不胜感激任何帮助!我是音频信号处理的初学者.

So, I'm stuck in this last part, I don't know how to perform the segmentation of M by 64x64. And how can I got the same numbers of segments for all the audios (with different duration), because at the final I will need 64x64xN features as input to my neural network or classifier? I will appreciate a lot any help! I'm a beginner with audio signal processing.

推荐答案

沿时间轴循环帧,一次向前移动 30 帧,并提取最后 64 帧的窗口.在开始和结束时,您需要截断或填充数据以获得完整帧.

Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.

import librosa
import numpy as np
import math

audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds

n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)


window_size = 64
window_hop = 30

# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size 
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)

for frame_idx in range(start_frame, end_frame, window_hop):

    window = frames[:, frame_idx-window_size:frame_idx]
    assert window.shape == (n_mels, window_size)
    print('classify window', frame_idx, window.shape)

会输出

classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)

然而,窗口的数量将取决于音频样本的长度.所以如果只有相同数量的窗口很重要,您需要确保所有音频样本的长度相同.

However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.

这篇关于如何使用上下文窗口来分割整个日志梅尔谱图(确保所有音频的段数相同)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆